Enterprise Storage Megathread: Why is my NAS a SAN?

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?

«‹›207 »

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

This is my first attempt at a megathread so try not to hate me for it. It was requested by a few people and I decided to take some initiative for once in my life.

What is a SAN?

A SAN is really any network you use to access storage. That storage could be CIFS shares or it could be fibre channel array.

Some people interchange this term with storage array (including some manufacturers and sales types) which confuses things further.

When I think SAN; I consider the switch fabric that connects everything together; the HBAs on the host side; and of course the targets in whatever form they may be.

Storage!

I do a lot of work specifically with NetApp right now but I've also dealt with compellent, Dot Hill, HDS, and to a lesser degree (becoming much more familiar now). I'm going to talk a bit about NetApp and how many of these concepts relate to NetApp, mostly because my current project is looking at ~40TB of NetApp and ~200+ ESX servers and so my brain is stuck on it.

There are a number of tiers of storage with accepted vendors that I'll outline here. This isn't the definitive list by any means and I'm sure people will come in and say "I'm using vendor X for Y " so don't hang me for it.

High-End:
EMC Symettrix
HDS AMS1000

Upper-mid:
NetApp 6080
3Par
Pillar Data

Mid-range:
Compellent
EMC Celerra/Clariion
NetApp 2050/3070
HP
IBM
Dell

Low end:
Dot Hill
StorVault (NetApp)
Equalogic

NetApp:
I use NetApp mostly because its one of the most flexible platforms you can buy. It can support fibre channel, iSCSI, NFS, and CIFS with a lot of great software options like snapshots, thin provisioning, de-duplication, and more.

NetApp also uses a very high performance RAID they call 'RAID DP' which is a dual parity algorithm that can sustain 2 drive failures before it dies. NetApp will call a RAID DP array an aggregate. On top of that aggregate sits WAFL, NetApps most awesome filesystem which is comparable to ZFS. When you create a fibre channel LUN, it's actually a file that sits on that filesystem and it's going to be spread out across every disk in an aggregate.

NetApp is typically favored by a lot of Oracle shops for its superior NFS implementation and it also makes an excellent platform for VMware (it also happens to be supported by Site Recovery Manager.)

Pillar Data:
Pillar Data was born of Larry Ellison saying "I'm tired of NetApp owning all of our customer's NFS business" and dumping 100 million dollars into a new hobby: enterprise storage. The device is supremely over-engineered and eventually I suspect it will cost a lot more than anyone else but it's pretty damned cool.

It supports something they call 'Data QoS' where data with high performance needs is marked and placed on the outside of the spindle where its spinning the fastest.

In addition to the main storage processors; every shelf has its own RAID controller.

One of my customers is testing this platform and loves its faster than netapp head failover but apparently miss a few netapp features. I didn't pry because I was there to learn them about VMware.

Compellent:
A pretty hot company in the mid-range market; they sell a lot of features that many would consider magic. Chiefly the automatically tiered data, block level RAID, and the usual suspects: snapshots and clever usage of snapshots (writable snapshots!)

This storage is great because you can buy a shelf of 15k RPM disks and a few shelves of SATA and probably get what can be perceived as tier1 spindle performance. It tracks block access and migrates them around spindles as needed, effectively and automatically archiving older data to SATA disks without additional software. I believe many people here have positive compellent experiences.

My only negative with these guys is that the UI gets kind of pokey on a loaded system and performance with OnStor is pretty abysmal; so we ended up doing a rip-n-replace at a client site with NetApp 3070s.

What features should I look for? What is important?

Since you're consolidating your storage; availability should become the first order of business. You'll typically want 2 storage controllers so when one dies you don't have to stop business.

Performance is typically the next thing to worry about. A number of things will impact this:
1. Drives
2. Cache
3. RAID level
4. Network

Drives are a multi-part equation that relate directly to RAID level and are ultimately the most important thing to worry about. They are commonly referred to as 'spindles' in the storage world and come in many flavors:
SAS
SATA
FC
SCSI
SSD (new with EMC)

And many speeds, commonly: 7200 RPM, 10k RPM, and 15k RPM. A single drive is capable of so many IO's every second or IOPs. There's a formula to calculate this out but a decent rule of thumb is that a single 15k RPM disk can provide 120-150 IOPs. That said; if your application needs 1500 IOPs then you're going to need at least 10 drives to achieve that performance. There's more to it than that; especially when you factor in RAID and cache but thats sort of the basics.

SSD disks are relatively new and I believe EMC is currently shipping a storage product that makes use of it. Blazingly fast but pretty expensive; it solves many of the latency issues

Cache is another important piece of the puzzle. Some storage controllers/processors will have upwards of 4GB of cache; which will basically give you a 4GB window before you need to worry about disk performance impacting what you're doing. Some arrays support cache partitioning (HDS) where you can give specific applications a dedicated amount of cache and others just use a whole lot of it.

Most storage arrays support RAID 0, RAID 1, RAID 5, RAID 10 (or 0+1) and some form of RAID 6. Each one has performance and availability limitations that should be considered as you're looking at your rollout.

The network is just as it sounds. This could be your traditional IP network or it could be a dedicated fibre channel network. Either way; its the last piece of the puzzle and if you've only got 600mbps of bandwidth available then that is going to be the best performance you can hope for.

So how do I decide what is best?

It's 100% absolutely dependent on the application you're running. On a typical EMC array I've seen several RAID 10 RAID groups configured alongside a number of RAID 5 RAID groups depending on wether the disks were handed off to a file server or a database server or a web server.

In short; there is no one real answer to this question unless you have a truly unlimited budget. Even then I'd say there is no real answer to this question.

I heard you need fibre channel to have a real SAN? (or I heard fibre channel was best)

PROVE IT! In most environments I've worked on in the last year; iSCSI and NFS were plenty fast for the storage needs. I would say in most cases it's probably best for your business with a few exceptions. If you're in that exception then this thread is probably pretty boring to you in the first place because you already know this poo poo.

If anyone is really interested; I'd be happy to outline an environment in which fibre channel is clearly and absolutely superior to IP storage solutions even factoring in 10gigE.

What is fibre channel anyway?

It's certainly not restricted to fibre optic cables and I can't seem to get people to wrap their mind around that. In fact; brocade's 256gbps FC uplinks on the DCX series are copper.

FC is a protocol that is akin to ethernet in its functionality. It's a very low latency protocol that provides a lot of speed. In an FC SAN, a host hands a SCSI command to an HBA which will then pack that SCSI command into an FCP frame which is then sprayed down the wire to a target; which effectively unpacks that FCP frame and executes the SCSI command.

It differs from ethernet in that it will automatically aggregate multiple links without requiring a separate load balancing protocol or having to worry about anything arriving out of order. That means if I've got two switches linked together with 4 8gbps uplinks it's going to 'spray frames' down each link providing ~32gbps of bandwidth.

Another key difference (except on cisco MDS switches) is that every switchport is capable of 100% utilization at the same time.

Some switch vendors would be Brocade, Cisco, and QLogic. I'll post more on these shortly.

Sounds expensive; how bout iSCSI?

iSCSI is a great because its a block level protocol that is carried over traditional IP networks and can typically be managed well enough with IP administrators. In most cases; performance is comparable to fibre channel though there are the occasional instances where it's not going to be suitable. Those will be rare though. To put it in perspective I have two clients off the top of my head running 1500+ user exchange databases on iSCSI in excess of ~800GB one of which is actually also running a 6 host VMware cluster on the same storage array (EMC celerra if you're curious).

It's anecdotal but I'd challenge anyone to take an FC device and an iSCSI device that are comparable and try to find a performance difference.

Moving on to NAS or in this case; NFS

Many of us had come to hate NFS as most of us knew it well on linux or solaris or in some cases IRIX :angry:

An interesting thing about NFS is that it performs as well as iSCSI in many cases and in fact has the blessing of Oracle to share out very large databases to multiple servers.

NetApp NFS is particularly good (though it's a pay for license unlike iSCSI) and some of the NFS optimizations can sometimes make it outperform iSCSI. Since it's not a block level access protocol; it's ideal for clustered filesystems like vmfs (see VMware) since you won't need to make SCSI reservations any time filesystem metadata needs to change.

Up next; what are these bad rear end features that make these expensive storage arrays worth so god damned much anyway?

edit:

Talking to rage-saq a thought occurred to me so I re-arranged things a bit.

1000101 fucked around with this message at 07:00 on Aug 29, 2008

# ? Aug 29, 2008 06:38

Adbot: ADBOT LOVES YOU

# ? Apr 27, 2024 10:27

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

So many of these storage arrays come with things like FlexClone or Data Instant Replay or de-duplication; but what does all this poo poo do and how does it work?

We'll talk about the most common feature available in most storage arrays from just about every vendor first. Mostly because this is the foundation for most other features anyway.

Snapshots!

We've all heard the term and it almost sounds like magic! How the gently caress can I instantly copy 20TB of data by snapshotting it?

Well you technically aren't actually copying anything. There's another pretty standard feature called replication that we'll get to in a bit.

Snapshotting works like this:
You take it and the storage immediately stops writing to that section of the disk; it immediately marks it as read-only and continues writing elsewhere.

What does this mean? If I have 50GB of data and I snapshot it and write another 5GB of data then I have 55GB of data. Now if I delete all of that data and write 45GB of data, I'll actually see 45GB of usage in my operating system but on the storage array I'm going to see 100GB.

Why? because the old data I snapshotted is still physically on the disk; it's just a different part of the disk. I can restore that snapshot and be right back at that point of time with a few mouseclicks in many cases.

What if you just change a file? Lets presume you have a document on your snapshotted volume and you decide to change the font from 'cominc sans' to 'arial'. What will happen is the storage will write your change over to a new section of the disk but reference the rest of the data from the snapshot until it changes. If you restore your snapshot and look at your file, you will notice that 'arial' is back to 'comic sans'

This of course is why in the previous example that we eat up 100GB of disk space.

Think of it as spawning a new timeline of data:

code:

-----my-data----------0  <---- snapshot at the 0
                                     \------------------more data-------0
                   						        \-----More data still----->

You might be asking yourself; won't this eat up a lot of disk space? Yes and no. As your data volume grows, yes you will certainly eat up more disk space. The thing you have to consider though is change rate. In most environments you have a very low change rate; probably less than 20% which is why 20% is considered the acceptable "overhead" for snapshots when figuring up storage capacity. i.e. if I need 1TB of usable storage and I want to use snapshots then I should really get 1.2TB of usable storage.

If you think about it; it starts to line up. If you're working on a page in Indesign; it could easily be a 40MB document for example. However you're changing positioning of various elements or fonts or colors. Otherwise minor changes in the grand scheme of things and in the end you might have less than 5MB of actual change on the blocks; so with snapshots your 40MB file eats up 45MB of actual disk space.

So whats the drawback? Disks are cheap why the hell not give everyone snapshots?

At the outset, most snapshots are crash consistent. What does that mean? It's in as usable a state as data on a disk where I just yanked the power. Transactions could be half committed or data may not be current because it was still buffered or whatever. This is especially critical on database type applications like Exchange or SQL.

To combat this; a number of things have been done over the course of the years. NetApp for example has released 'SnapManager for Exchange/SQL/VMware' which will actually quiesce your database (essentially letting it know its going to be snapshotted and prepare for it) and make sure the snapshot is statefull and therefor usable when you need to back up to it.

So the next whiz bang thing we'll talk about is replication. Replication will typically take your snapshot and send them off to some other storage array which is probably the same make/model.

Obviously enough, the first sync will take forever since you've got to get all the data over there in the first place. In those cases we tend to replicate the data locally then ship the backup storage off site and send over the deltas.

What's a delta?

Essentially the data from the start of the snapshot until now or the next snapshot. To sum it up further; all my changes.

Once that's done you should be able to figure out whats going on from there. Depending on change rate; you can have remote replication on as slow as a T1 link without negatively impacting your business. Of course; this depends on change rate and frequency of replication. Obviously if you're changing 100MB an hour then you're going to need enough bandwidth to move 100MB an hour (in truth a little faster).

Past replication; snapshotting leads into other technologies like cloning. Cloning is not technically replication on most storage arrays and in many cases can be referred to as a 'writable snapshot.' Essentially you take a snapshot as before but you can then clone that snapshot which will effectively create another "timeline"

That clone can then be handed off to another server to do whatever with. A common application would be to clone a production database to hand to dev/QA teams to test an application against real data.

It'd look something like this:

code:

                                     /----Production------>
----Base Data--------0
                                     \----Development---->

Even if a developer goes apeshit and 'rm -rf's' his database it's not going to hurt the base data; just that developer's "timeline" so to speak.

Interestingly enough; a few vendors have taken this line of thinking and applied it to boot from SAN, VDI, or other nifty things. Imagine handing your 100 linux boxes with red hat ES 5 the same 30GB boot LUN off your storage and the only consumed space is essentially the configuration information for each server. Something like this:

code:

			   /---hostA configs-->
                                     /----hostB configs------>
----Base Data--------0-----hostC configs--->
                                     \----hostD configs---->
                                       \--hostE configs---->

Now we all know that between a number of otherwise identical linux servers; that the hostname and IP settings occupy less than 1MB of storage. Imagine if you only needed to worry about storing the common data once and the changes between the boxes is the only extra overhead? Storage savings ahoy!

The last thing I'll talk about is de-duplication. This is relatively new and some implementations actually tie directly to snapshots (hello flexclones and server instant replay!) and in a way is basically cloning. I'll talk about NetApp's because it's free and one of my customer's has a huge hard on for it.

NetApp calls it A-SIS which I believe is 'advanced single instance storage' or some poo poo. What this will do is scan all the blocks on a given volume searching for identical blocks. When it finds them; it replaces all instances of that block except one with a pointer to a real block. This is typically a scheduled task and is pretty CPU intensive. A simple way to look at it is to basically say it is a block level "zip" that uncompresses on the fly as needed. There are performance implications to consider though since now that one block is probably being accessed about a zillion times more frequently.

edit: my little graphs are hosed up; if anyone wants I'll try to work something up in visio.

1000101 fucked around with this message at 06:41 on Aug 29, 2008

# ? Aug 29, 2008 06:38

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

So this poo poo all sounds fancy and expensive; how much does it cost man!?!?!

Not as much as it used to!

There are a number of reasons behind this and to sound like a sales snake I'm going to use the term "market drivers"

1. data storage needs are growing for small to mid sized businesses
2. Virtualization!!
3. technology is getting less expensive
4. A lot of manufacturers want your money

What cost 250 grand a few years ago is easily 50 thousand dollars now.

Depending on options you can consolidate all of your storage for <30k fairly easily. I think you can get 4TB of StorVault (low end netapp) for ~12-15k if not less.

The benefits of consolidation are pretty clear:

For one; you make your servers a disposable commodity. IBM has a great deal? Use that; get it with two small disks to hold the OS.

Not sure how much storage you need to grow? buy what you need and as you fill it up buy more disks!

The answer to many of your problems becomes 'buy more disks' but now these disks are available for more than just the box you're putting them in.

# ? Aug 29, 2008 06:55

Richard Noggin: Jun 6, 2005; Redneck By Default

Very nice, thanks!

We're going to be moving our server lineup to ESX 3i, and we're trying to determine whether we can get by with DAS or if we need a SAN. We're a small company, but we have a couple DB-centric apps that run. Here's the setup:

1 SBS 2003 box, serving ~5 users
1 Web based management system with a SQL backend
1 Ticketing system with a SQL backend
1 Server 2k3 box as a backup DC

We'd create a new VM for hosting SQL for both the management and ticketing systems, so the load would go down on those VMs but another one would be picking up the slack.

The new server would be a DL360 G5 with dual quad Xeons with 16GB RAM. Our current switch is a POS Linksys 16 port managed GB job - it's fine for what we're doing now, but I'm not sure about iSCSI traffic. Our storage and IO needs aren't going to change a whole lot in the next few years.

The DAS we're looking at is a MSA60 with ~900GB usable space (12 146GB SAS drives in RAID 10). For a SAN, we'd be looking at the MSA 2000 series running on iSCSI, probably with a similar drive setup.

Any thoughts about what would be the best bang for the buck? The SAN alone is $15k+, while DAS + the server comes in at $11k.

edit: we're a HP shop.

Richard Noggin fucked around with this message at 12:51 on Aug 29, 2008

# ? Aug 29, 2008 12:49

lilbean: Oct 2, 2003

LE posted:

Don't forget about Sun!

I was just gonna post this. I ordered a 4540 today (more cores, more RAM, etc). Can't loving wait for it to arrive.

I plan on using it for staging backups to disk first mostly, but also a general NFS storage server.

# ? Aug 29, 2008 15:28

Ray_: Sep 15, 2005; It was like the Colosseum in Rome and we were the Christians." - Bobby Dodd, on playing at LSU's Tiger Stadium

I've been using Datacore's SANMelody software-based SAN lately. It runs on top of a standard Windows Server install. From my testing, it works pretty well. Has anyone else used it?

# ? Aug 29, 2008 15:36

H110Hawk: Dec 28, 2006

lilbean posted:

I was just gonna post this. I ordered a 4540 today (more cores, more RAM, etc). Can't loving wait for it to arrive.

I've started the process of getting a try-and-buy of a 4540+J4000. I'm waiting to see if they've fixed a minor sata driver problem.

http://www.dreamhoststatus.com/index.php?s=file+server+upgrades

# ? Aug 29, 2008 15:50

mats99: Dec 30, 2005; Equifax ?

Ray_ posted:

I've been using Datacore's SANMelody software-based SAN lately. It runs on top of a standard Windows Server install. From my testing, it works pretty well. Has anyone else used it?

Yup, have it running on a clients site for one year now, 2 file servers, 1 database server and 5 VMware Server (non ESX) connected to it, it works like a charm. I'm no expert but I can answer questions on it.

# ? Aug 29, 2008 16:15

lilbean: Oct 2, 2003

H110Hawk posted:

I've started the process of getting a try-and-buy of a 4540+J4000. I'm waiting to see if they've fixed a minor sata driver problem.

http://www.dreamhoststatus.com/index.php?s=file+server+upgrades

Hm, haven't heard of that one. Do you have a bug ID for that? Is it a performance related issue?

# ? Aug 29, 2008 16:24

feld: Feb 11, 2008; Out of nowhere its.....

Feldman

Pillar is what you want if you run Oracle

1000101 posted:

High-End:
EMC Symettrix
HDS AMS1000

Upper-mid:
NetApp 6080
3Par
Pillar Data

I challenge your heierarchy. Pillar has had features nobody else has had when we bought our SAN last year. TWO controllers per each brick, QoS on the data (now by APPLICATION!), ability to choose where on the disk the data is at (inside=slow, middle=medium, outside=faster). That was always known as "short stroking" the disk and until Pillar nobody offered it because when you started doing it you couldn't use the rest of the disk. Pillar gives you access to the whole disk still.

I'm throwing my gloves down and saying Pillar wholeheartedly deserves to be in the High-End.

FFS we paid $40,000 for 2TB. :cry:

feld fucked around with this message at 16:44 on Aug 29, 2008

# ? Aug 29, 2008 16:35

Syano: Jul 13, 2005

EMC Clariion checking in.

I run a single 15 drive DEA populated with 143gig 15rpm U320 scsi drives over an iSCSI backplane.

I've had to become intimately familiar with this beast along with wonderful technologies such jumbo MTU, chimney offload/TOE, and raid levels here recently so more than happy to answer any questions I can

Also, any discussion of enterprise storage is incomplete without involving Raid3. It still has a place where most of your writes will be sequential (hey there big database). We use it for several database and it outperforms raid5 without the storage overhead of 1+0

# ? Aug 29, 2008 18:10

Syano: Jul 13, 2005

Richard Noggin posted:

Very nice, thanks!

We're going to be moving our server lineup to ESX 3i, and we're trying to determine whether we can get by with DAS or if we need a SAN. We're a small company, but we have a couple DB-centric apps that run. Here's the setup:

1 SBS 2003 box, serving ~5 users
1 Web based management system with a SQL backend
1 Ticketing system with a SQL backend
1 Server 2k3 box as a backup DC

We'd create a new VM for hosting SQL for both the management and ticketing systems, so the load would go down on those VMs but another one would be picking up the slack.

The new server would be a DL360 G5 with dual quad Xeons with 16GB RAM. Our current switch is a POS Linksys 16 port managed GB job - it's fine for what we're doing now, but I'm not sure about iSCSI traffic. Our storage and IO needs aren't going to change a whole lot in the next few years.

The DAS we're looking at is a MSA60 with ~900GB usable space (12 146GB SAS drives in RAID 10). For a SAN, we'd be looking at the MSA 2000 series running on iSCSI, probably with a similar drive setup.

Any thoughts about what would be the best bang for the buck? The SAN alone is $15k+, while DAS + the server comes in at $11k.

edit: we're a HP shop.

If all else was equal, the best bang for your buck will be the SAN. The direct attached storage will never have the flexibility of the SAN nor the growth potential (saying this with only minimal knowledge of the solution you are looking at)

# ? Aug 29, 2008 18:13

1000101: May 14, 2003; BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY BIRTHDAY FRUITCAKE!

feld posted:

Pillar is what you want if you run Oracle

I challenge your heierarchy. Pillar has had features nobody else has had when we bought our SAN last year. TWO controllers per each brick, QoS on the data (now by APPLICATION!), ability to choose where on the disk the data is at (inside=slow, middle=medium, outside=faster). That was always known as "short stroking" the disk and until Pillar nobody offered it because when you started doing it you couldn't use the rest of the disk. Pillar gives you access to the whole disk still.

I'm throwing my gloves down and saying Pillar wholeheartedly deserves to be in the High-End.

FFS we paid $40,000 for 2TB.

Interestingly enough; Pillar was actually in the high end area but I bumped it down to "upper-mid" to account for the fact that its relatively new and hasn't been put through the same paces as an HDS or EMC Symetrix yet.

When I'm not at work I'll post more.

# ? Aug 29, 2008 18:18

Mierdaan: Sep 14, 2004; Pillbug

Thanks for the thread, 1000101.

Anyone out there using Dell's MD3000i box? We're just getting to the point where we need something low-end, due to increasing interest in virtualization and an unwillingness on my part to replace an ancient Proliant ML350 file server with another traditional file server. We don't have anything terribly IOPS-intensive we'd be putting on it; probably just Exchange transaction logs SQL transaction logs + DB for 200-person company setup, so I don't think iSCSI performance issues are worth worrying about for us.

It's a 15 spindle box, so we're thinking about carving it up thusly:

1) RAID 10, 4x 300GB 15K RPM SAS drives: SQL DB
2) RAID 10, 4x 146GB 15K RPM SAS drives: SQL transaction logs
3) RAID 1, 2x 73GB 15K RPM SAS drives: Exchange transaction logs
4) RAID 5, 4x 450GB 15K RPM SAS drives: VMimages, light file server use
(with 1 spare drive for whatever RAID set we decide most needs it)

That'd take care of our immediate needs and give us some room to expand with additional MD1000 shelves (2 can be added to the MD3000i, though the IO doesn't scale at all). We're a small shop and have no experience with SANs, so I could definitely use some feedback on this idea.

Mierdaan fucked around with this message at 20:04 on Aug 29, 2008

# ? Aug 29, 2008 19:39

H110Hawk: Dec 28, 2006

lilbean posted:

Hm, haven't heard of that one. Do you have a bug ID for that? Is it a performance related issue?

They seem to not be making a big deal of it. The X4500 shipped with a broken sata driver, which they consider low priority, even though the box is 6x8 sata cards with some cpu and memory stuffed in the back. We had to install IDR137601 (or higher) for Solaris 10u4 to fix it. The thumpers all ship wih u3, so first you have to do a very tedious upgrade process.

Sorry, I don't have a bug ID. OpenSolaris suffers as well, google "solaris sata disconnect bug" or "solaris marvell" and you will find some people who hit it. It's pretty much anyone who puts any kind of load on a thumper. Or in my case, 29 thumpers.

# ? Aug 29, 2008 19:56

unknown: Nov 16, 2002; Ain't got no stinking title yet!

How many people are doing boot-from-SAN (ie: no hard drives on the physical server)?

I'm absolutely loving it, you can't beat the fact of screwing up a boot disk, and to fix you simply present + remount it as a regular volume on a different computer to easily fix. All from a 1000 miles away.

# ? Aug 29, 2008 21:40

rage-saq: Mar 21, 2001; Thats so ninja...

unknown posted:

How many people are doing boot-from-SAN (ie: no hard drives on the physical server)?

I'm absolutely loving it, you can't beat the fact of screwing up a boot disk, and to fix you simply present + remount it as a regular volume on a different computer to easily fix. All from a 1000 miles away.

I've done a few good sized network upgrades. At two remote co-located sites we did a blade chassis and a small SAN and did boot-from-SAN. Each site has an extra blade on hand in case a server were to fail so we could remap the LUN and get it back up and running remotely in a short period of time.
We did that with the customer before they had gone through testing and validation of virtualization instead, which is how all new deployments are done.

# ? Aug 29, 2008 22:32

lilbean: Oct 2, 2003

H110Hawk posted:

They seem to not be making a big deal of it. The X4500 shipped with a broken sata driver, which they consider low priority, even though the box is 6x8 sata cards with some cpu and memory stuffed in the back. We had to install IDR137601 (or higher) for Solaris 10u4 to fix it. The thumpers all ship wih u3, so first you have to do a very tedious upgrade process.

Sorry, I don't have a bug ID. OpenSolaris suffers as well, google "solaris sata disconnect bug" or "solaris marvell" and you will find some people who hit it. It's pretty much anyone who puts any kind of load on a thumper. Or in my case, 29 thumpers.

Jeeeeesus Christ, 29 of them? Nicely done. How do you have your ZPOOLs laid out on them? We're basically going to use ours for a backup disk (with NetBackup 6.5's disk staging setup), so we'll be writing and reading in massive sequential chunks. We plan on benchmarking with different setups like 40 drives in mirrors of 2, raidz and raidz2 vdevs (in different group sizes).

It'll probably take us weeks just to figure out the best layout for our load.

Edit: As for the u3, I'm not too worried about that. We've used LiveUpgrade extensively to move things from Solaris 8 to 10 as well as for patching systems with less downtime, so I imagine our Thumper's system disk will be an SVM mirror across 2 of the physical disks, with a third being reserved for the upgrades.

# ? Aug 30, 2008 00:30

KS: Jun 10, 2003; Outrageous Lumpwad

I'm so glad you posted this. We've been going through a SAN nightmare for the last month and I wanted to make a thread, but the audience in SH/SC for the enterprise level stuff seems to be very limited.

The production SAN I'm dealing with comprises two HP EVA8100s with 168 disks each, two Cisco MDS9509 switches, and a bunch of HP blades with Qlogic cards. It should be massive performance overkill for our needs. One EVA houses a 5 node SLES file server cluster, while the other houses a 6-server ESX cluster, ~10 HP-UX servers, and ~20 windows servers on 3 distinct and relatively equal sized disk groups.

Performance is terrible. On servers with dual 2gig HBAs, sequential read performance maxes out between 70-120MB/sec. The striking thing is that single path performance is equal or better to the multipath performance. It just seems to be a per-vdisk speed limit.

We hooked up a server using a spare brocade switch to a spare EVA3000 and saw an identical pattern, but much better performance. Linux servers get 200MB/sec multipathing, but a single path transfer is the same speed, and two single path transfers down each HBA can saturate the fiber at 400MB/sec. I've verified using iostat that multipathing is actually working correctly -- each path seems to be capped at exactly 100MB/sec. A windows server using MPIO gets 300MB/sec.

Stuff we've tried:
updated HBA drivers
updated HBA firmware
Emulex HBA
Different SAN switch
Different EVA
Dell server with qlogic 2460s

This EVA is supposed to provide around 2.4GB/sec sustained reads as configured, and we're struggling along with basically single spindle speeds.

The Linux servers are using multipathd, and the path checker is getting errors like SCSI error : <1 0 0 1> return code = 0x20000. I've started to read that this means the fabric is sending RSCNs, and that they could hurt performance fabric-wide. Anyone have more info? Are we messing up our switch config or something? Is there a way to trace what's causing them?

The Cisco switches are using WWPN zoning and have an ISL trunk between them that will be removed soon, as each switch serves a seperate fabric now.

I am not the SAN admin, just the guy who discovered the problem. Advice is appreciated!

(1000101, small wonder I was complaining about our ESX performance, huh?)

KS fucked around with this message at 05:36 on Aug 30, 2008

# ? Aug 30, 2008 03:27

Stugazi: Mar 1, 2004; Who me, Bitter?

Good thread.

I've worked on two EMC's and have a little HDS experience. We currently sell StoreVaults (which have now been rolled up into Netapp and renamed the S Family) to clients.

The StoreVaults are really nice boxes for the price. For under $10k you get a 3TB chasis (capable of 12TB max) with 90% of the whizbang features of a FAS2020. Anyone under 200 employees should check them out. I'd be happy to answer questions on them.

# ? Aug 30, 2008 04:42

Stugazi: Mar 1, 2004; Who me, Bitter?

KS posted:

Performance is terrible. On servers with dual 2gig HBAs, sequential read performance maxes out between 70-120MB/sec.

You need to get some professional services people in there to troubleshoot. With that kind of hardware it's pretty clear you should have the budget to make that problem go away.

The cost in lost performance should easily force someone to write a check for an HP/Cisco guy to come out and pimp your rig.

# ? Aug 30, 2008 04:48

KS: Jun 10, 2003; Outrageous Lumpwad

JollyRancher posted:

You need to get some professional services people in there to troubleshoot. With that kind of hardware it's pretty clear you should have the budget to make that problem go away.

It sounds like you have not experienced the joys of working for the government. I'm sure it's in the budget for CY09

A few of us are interested in a quicker fix, however. It's a long shot, I know.

# ? Aug 30, 2008 05:13

skipdogg: Nov 29, 2004; Resident SRT-4 Expert

JollyRancher posted:

You need to get some professional services people in there to troubleshoot. With that kind of hardware it's pretty clear you should have the budget to make that problem go away.

The cost in lost performance should easily force someone to write a check for an HP/Cisco guy to come out and pimp your rig.

Nah nah, gently caress that, with that kind of kit you call your HP account manager up and say WHAT THE gently caress get someone out here right now. Enterprise IT can be a small world and one or two high level IT guys badmouthing expensive kit gets spread around pretty quick. It's in HP's best interest to take care of you post haste.

:buddy:

Oh you hear so and so over at xyz INC just deployed a couple of EVA8100's?

:v:

Yeah? I was thinking about those, the HP guys keep trying to sell me on them..

:buddy:

Well he says they loving suck and stay away...

:v:

Well I'll NetApp and EMC bid the job then.

A bad experience with a company can seriously sour an experience.. Hell we had some issue with Dell servers 5 years ago and we still won't touch them with a 10 foot pole even though the newer poweredge's are pretty nice. A call to your account manager should help get the ball rolling, if it doesn't go above them.

edit:

KS posted:

It sounds like you have not experienced the joys of working for the government.

Oh....

# ? Aug 30, 2008 05:16

ExileStrife: Sep 12, 2004; Happy birthday to you!
Happy birthday to you!

I was working on one of the storage teams at a very large company, though not working directly on the technology (mostly just workflow improvements). Never had to get my hands dirty with this stuff, but one of the side projects that I would hear about occasionally was bringing in three DMX 4's to move 1.4 PB onto. Since each DMX 4 can handle 1 PB alone, what kind of factors would drive the decision to get three? Future capacity management seems like an odd answer to me, since the forecast was not anywhere near that in for the near future because other data centers were going up. Might this be for some kind of redundancy? Is it possible for one of those DMX's to completely fail? Is it seriously like one singular, monster storage array?

# ? Aug 30, 2008 05:20

R-Type: Oct 10, 2005; by FactsAreUseless

Do you know what would be awesome? A good, clear MS iSCSI Initatior MPIO configuration guide for setting up both HA and path aggregation. I'm embarrassed to admit that I don't understand how to clearly establish MPIO between multipule ports between A W2k8 box and a iSCSI SAN like Openfiler. It seems like MS will only want to use only one NIC regardless of how many are installed (in my case, the iSCSI lan is connected to a intel 1000 PL PCIe dual port NIC.

R-Type fucked around with this message at 16:16 on Aug 30, 2008

# ? Aug 30, 2008 15:25

adorai: Nov 2, 2002; 10/27/04 Never forget; Grimey Drawer

ExileStrife posted:

I was working on one of the storage teams at a very large company, though not working directly on the technology (mostly just workflow improvements). Never had to get my hands dirty with this stuff, but one of the side projects that I would hear about occasionally was bringing in three DMX 4's to move 1.4 PB onto. Since each DMX 4 can handle 1 PB alone, what kind of factors would drive the decision to get three? Future capacity management seems like an odd answer to me, since the forecast was not anywhere near that in for the near future because other data centers were going up. Might this be for some kind of redundancy? Is it possible for one of those DMX's to completely fail? Is it seriously like one singular, monster storage array?

it was probably for failover. if you have two units with 700TB on each, and one fails, then if the max is 1PB then you will exceed it with the failover. So you add an additional unit and then you can failover properly.

# ? Aug 30, 2008 16:05

complex: Sep 16, 2003

R-Type posted:

Do you know what would be awesome? A good, clear MS iSCSI Initatior MPIO configuration guide for setting up both HA and path aggregation. I'm embarrassed to admit that I don't understand how to clearly establish MPIO between multipule ports between A W2k8 box and a iSCSI SAN like Openfiler. It seems like MS will only want to use only one NIC regardless of how many are installed (in my case, the iSCSI lan is connected to a intel 1000 PL PCIe dual port NIC.

I assume you have installed the MPIO Multipathing Support for ISCSI thinger. http://technet.microsoft.com/en-us/library/cc725907.aspx

# ? Aug 30, 2008 17:26

StabbinHobo: Oct 18, 2002; by Jeffrey of YOSPOS

ok here's one that annoys me.

Can the "tps" field in an iostat -d be reasonably equated to the IOPS measurements you'd see in navianalyzer or 3par service reporter?

edit: same goes for sar -b

StabbinHobo fucked around with this message at 03:20 on Aug 31, 2008

# ? Aug 31, 2008 03:16

oblomov: Jun 20, 2002; Meh... #overrated

complex posted:

I assume you have installed the MPIO Multipathing Support for ISCSI thinger. http://technet.microsoft.com/en-us/library/cc725907.aspx

I have to say that Server 2008 iscsi and MPIO performance is just plain better compared even to 2.07 on Server 2003. Plus, finally Microsoft is supporting dynamic disks. Funny story, DPM 2007 requires dynamic disks (well, unless you really really want to deal with creating all partitions manually). Good going, Microsoft!

# ? Sep 1, 2008 19:21

oblomov: Jun 20, 2002; Meh... #overrated

Also, on VMware and NetApp. RSM is only supported on iSCSI, I believe. Fiber support should be out shortly, but NFS is not in the picture just yet.

# ? Sep 1, 2008 19:23

oblomov: Jun 20, 2002; Meh... #overrated

oblomov posted:

Also, on VMware and NetApp. RSM is only supported on iSCSI, I believe. Fiber support should be out shortly, but NFS is not in the picture just yet.

Oh, and on the subject of NetApp and iSCSI. I have been running a 12K Exchange environment on a clustered NetApp 3020 with 10 shelves of 15K rpm disks, and it's running like a champ. 2 iSCSI connections from each Exchange cluster node (2 Exchange clusters, 4 and 3 nodes) go to a pair of Cisco 3750 switches. Each NetApp head has 6 iSCSI ports with failover VIF trunking two aggregation VIFs in failover mode. I built this setup 2.5 years ago and haven't had any major issues.

# ? Sep 1, 2008 19:28

oblomov: Jun 20, 2002; Meh... #overrated

oblomov posted:

Oh, and on the subject of NetApp and iSCSI. I have been running a 12K Exchange environment on a clustered NetApp 3020 with 10 shelves of 15K rpm disks, and it's running like a champ. 2 iSCSI connections from each Exchange cluster node (2 Exchange clusters, 4 and 3 nodes) go to a pair of Cisco 3750 switches. Each NetApp head has 6 iSCSI ports with failover VIF trunking two aggregation VIFs in failover mode. I built this setup 2.5 years ago and haven't had any major issues.

Although I must say I have been disappointed with NetApp lately. Ability to add shelves and additional fiber loops to clustered heads is quite lacking, can't do much live. Also, their sales engineering and sales people are way over-engineering environments quite a bit. You have to call them out on that and negotiate them down.

Btw, my company does have some MD3000i arrays in remote offices and lab environments, and they are great little iSCSI SANs. You can also attach 2 more shelves of MD1000 to each MD3000i array. I am running a 3 4x4 core nodes with about 100 VMs off one of these setups in a lab with no issues. Performance is obviously not the greatest, but it's a lab environment, so it's good enough.

# ? Sep 1, 2008 19:33

H110Hawk: Dec 28, 2006

lilbean posted:

Jeeeeesus Christ, 29 of them? Nicely done.

Thanks.

They're big, loud, HEAVY, NOISY monsters, but if you don't care about power redundancy you can stuff 6 of them in a 120v 60amp rack! Once they're purring along with that IDR they're lots of fun. :3:

quote:

How do you have your ZPOOLs laid out on them? We're basically going to use ours for a backup disk (with NetBackup 6.5's disk staging setup), so we'll be writing and reading in massive sequential chunks. We plan on benchmarking with different setups like 40 drives in mirrors of 2, raidz and raidz2 vdevs (in different group sizes).

Only disks 0 and 1 are bootable (c5t0d0 and c5t4d0), but you are correct, they come in a SVM mirror. It makes upgrades not so scary, since if you totally bone it somewhere, you can revert quickly. The new x4540 seems to be able to boot from flash, which will be quite nice, adding 2 more spindles to the mix.

Right now we're only getting 11tb usable out of a machine, with 5 disks raidz2's and a handful of spare disks.

Oh, and a stock thumper zpool won't rebuild from a spare, either. It gets to 100% and starts over. Enjoy! :cheers:

# ? Sep 1, 2008 20:10

lilbean: Oct 2, 2003

H110Hawk posted:

Thanks. They're big, loud, HEAVY, NOISY monsters, but if you don't care about power redundancy you can stuff 6 of them in a 120v 60amp rack! Once they're purring along with that IDR they're lots of fun.

Well with only one it shouldn't be too much trouble. As for the weight, well I think I'll make our co-op student rack mount the thing - and take the cost out of his paycheck if he breaks it by dropping it.

quote:

Oh, and a stock thumper zpool won't rebuild from a spare, either. It gets to 100% and starts over. Enjoy!

Yeesh, is that with the unpatched Solaris 10 that comes with it? I'd planned on a fresh install once I get it with the latest ISOs and then patching it.

# ? Sep 1, 2008 20:31

Wedge of Lime: Sep 4, 2003; I lack indie hair superpowers.

H110Hawk posted:

They seem to not be making a big deal of it. The X4500 shipped with a broken sata driver, which they consider low priority, even though the box is 6x8 sata cards with some cpu and memory stuffed in the back. We had to install IDR137601 (or higher) for Solaris 10u4 to fix it. The thumpers all ship wih u3, so first you have to do a very tedious upgrade process.

Sorry, I don't have a bug ID. OpenSolaris suffers as well, google "solaris sata disconnect bug" or "solaris marvell" and you will find some people who hit it. It's pretty much anyone who puts any kind of load on a thumper. Or in my case, 29 thumpers.

The 'Marvell bugs' have now been fixed as part of an official patch, the following patches:

127128-11 : Solaris 10 U5 Kernel Feature patch
138053-02 : marvell88sx driver patch

supersede any of the older IDRs that may be available through your official support channel, that is: (IDR136658-01, IDR137601-02, IDR137889-04 and T138053-01). Please ensure that the IDR is removed prior to patching, that you read the patch README and patch in single user mode.

That being said, you may hit the following bug after moving to 138053-02..

http://bugs.opensolaris.org/view_bug.do;jsessionid=b3ba1097bf63c68b29efdc3c5c03?bug_id=6723520

If you're running an X4500 I would recommend moving to these over the IDR. Sun does take this issue seriously, its just getting this thing fixed has not been easy

Thank you.

Also, before doing anything with ZFS please read this:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Wedge of Lime fucked around with this message at 20:59 on Sep 1, 2008

# ? Sep 1, 2008 20:47

cypherks: Apr 9, 2003

EMC partner here. I can answer pretty much any question about the CX3 line and probably the CX4.

I do mostly Clariion and VMware installs.

Has anyone played with the new Brocade 8GB stuff, or FCoE?

# ? Sep 1, 2008 20:53

cypherks: Apr 9, 2003

R-Type posted:

Do you know what would be awesome? A good, clear MS iSCSI Initatior MPIO configuration guide for setting up both HA and path aggregation. I'm embarrassed to admit that I don't understand how to clearly establish MPIO between multipule ports between A W2k8 box and a iSCSI SAN like Openfiler. It seems like MS will only want to use only one NIC regardless of how many are installed (in my case, the iSCSI lan is connected to a intel 1000 PL PCIe dual port NIC.

Why would you use a software Initiator? iSCSI cards aren't too expensive.

In any event, yes, MS is pretty bleak with their iSCSI documentation.

# ? Sep 1, 2008 21:09

dum2007: Jun 13, 2001; I may be the victim of indigestion, but she is the product of it.

I spent last year working as an AIX / pSeries consultant (woah! I'm certified!?), so I thought I'd chime in with a really cool device I've personally worked with.

Say you have a typical fiber SAN.

code:

Storage Controller A ----== Fiber Channel Switch ==== Hosts
Storage Controller B ---/
...etc

You can login to the management interface of one of your SAN boxen and allocate some disk to your host.

When you RAID 0 two disk drives you get a performance boost, because you're pulling in parallel from two drives, right? Same goes for if you were using two FC disk controllers and you could get aggregate performance from both.

Wanna do this the cheap way? Use software RAID on the host itself. If your controllers are both very resilient and you're not worried about the risk of one of them going down altogether, you could stripe across them. I wouldn't always recommend this for a business-critical high-availability setup, but you see where the performance would come from.

Now, say you get two brand new controllers that have faster disks in them and you want to migrate the data from one controller to another? Or maybe you need to shrink or grow the volume? Or maybe your controller sucks and only lets you allocate sets of whole disks to a host, and not portions of them? When hours count and you need to maintain your storage environment, this is all time consuming.

Say hello to the IBM SAN Volume Controller:

code:

Storage Controller A ----== Fiber Channel Switch == -[b]SVC Node[/b] -== FC Switch ===  Hosts
Storage Controller B ---/                           \-[b]SVC Node[/b]-/

The SVC basically sits between your managed Disks (mDisks) and presents virtual Disks (vDisks) to your hosts. vDisks are very flexible - now you can:

- Migrate a vDisk to some other group of physical mDisks. This is crucial when you can't take your production database down and you need to migrate it to newer, faster disk.

- Create a vDisk which spans eight controllers full of disks for extreme speed via their aggregate throughput. Your host will still see this vDisk as one LUN.

- Use mirroring capabilities to keep a synchronous or asynchronous* mirror of a vDisk at another location (disaster recovery).

Physically, the SVC is a couple of 1U Intel boxes running a proprietary OS configuration that keeps a translation table of blocks. [vDisk "ORACLEDB"@Block1234 -> mDisk "DS4700A"@Block8093]. This translation is extremely fast, and even a 2-node SVC can handle fairly large installations. I think they support up to 8 nodes (4 pairs) and the system holds a world record for disk throughput as long as you back it with appropriate ($$$) storage controllers.

Oh, did I mention the boxes do caching, too? Just an added bonus - but that's why you must install each node with its own UPS.

Example real world application: I worked at a place with a 2 gigabit pipe across a city so their production SAP databases would be live mirrored to their disaster recovery site. I'm told this is one of the few clients that actually got such a Global Mirror working. They used a pair of Cisco FCOE units, two SVC installations and whatever backend disk they had on hand.

Oh, and the appropriate config for an SVC means that you have no single point of failure in the system. There are two nodes, they should be connected to two fiber channel switches. If one node dies, disk operations continue uninterrupted.

The coolest feature - I thought - was recovery mode. If an SVC node's hard disk dies or it can't boot, it will boot up, communicate with the other node over fiber channel, mount an operating system disk from the good node and boot back up. You get some freaky chevrons on the server's LCD panel when this is happening.

Anyway, this is all off the top of my head and I haven't worked with an SVC for a year - although I have a pretty cool certificate from a training course I did. It's definitely my favourite piece of SAN hardware.

...if only it weren't licensed by the TB and ludicrously expensive: Something like $40,000 for a base config + $7000 per terabyte, last I heard.

# ? Sep 1, 2008 21:38

H110Hawk: Dec 28, 2006

lilbean posted:

Well with only one it shouldn't be too much trouble. As for the weight, well I think I'll make our co-op student rack mount the thing - and take the cost out of his paycheck if he breaks it by dropping it.

Hah! I hope your disability insurance is paid up.

quote:

Yeesh, is that with the unpatched Solaris 10 that comes with it? I'd planned on a fresh install once I get it with the latest ISOs and then patching it.

Yup! Thing should ship with a damned working copy of Solaris.

Wedge of Lime posted:

The 'Marvell bugs' have now been fixed as part of an official patch, the following patches:

127128-11 : Solaris 10 U5 Kernel Feature patch
138053-02 : marvell88sx driver patch

If you're running an X4500 I would recommend moving to these over the IDR. Sun does take this issue seriously, its just getting this thing fixed has not been easy

Do you work for Sun? If so, I would like to speak with your privately about this stuff. I've sunk a lot of man hours into this thing trying to patch something with an IDR for U4 of Solaris based on a plan from our gold support contract.

It looks like the marvell patch was just released a month ago. I'll have to ask my sales rep why we weren't notified about it.

quote:

Also, before doing anything with ZFS please read this:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Yes, read this, it is awesome.

# ? Sep 1, 2008 22:11

Adbot: ADBOT LOVES YOU

# ? Apr 27, 2024 10:27

KS: Jun 10, 2003; Outrageous Lumpwad

Update:

We shut down our entire SAN last weekend and brought up one linux server, then one 2k3 server. Performance was identical to the performance we get in the middle of the day at peak load.

We found some benchmarks here .These are reads in MB/sec over numbers of I/O streams.

This jibes with what we're seeing. How is that single stream performance anywhere near acceptable? I can throw 4 SATA disks in a software RAID-5 and beat that read performance.

What are the strategies, if any, we should be implementing here? Striping volumes across multiple vdisks? Tweaks to increase the number of i/o "streams" per server? How will we ever get acceptable ESX performance?

:smith:

KS fucked around with this message at 16:57 on Sep 2, 2008

# ? Sep 2, 2008 16:53

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > Enterprise Storage Megathread: Why is my NAS a SAN?

«‹›207 »