Storage Russian Roulette

I wonder what everyone's view of Storage Russian Roulette is.  I am absolutely positive we play it every day of the week but just don't think about it all that much.  Over the last few years, I have had my fair share of events that vendors would suggest will (probably) not happen.  I suppose it is risk management at its best but just how many sales are made on that basis.

What is the chance that both controllers on a midrange storage solution (such as the LSI rebranded arrays) would reset for no apparent reason?  Ouch!!!

What is the chance that a disk failure would cause a midrange controller (such as the Sun SE3510) to failover and Quick I/O hates that!  Grrrr!!!

How many double disk failures are likely to happen on a Sun T300/T3B just because the spare sat there and did nothing for years and when it had to work, it didn't? 

I am sure that no vendor actually talked about those events to the customer who is more likely some manager with little storage knowledge making the decision becuase THEY can.

But what are the chances that the so called bullet proof arrays such as the HDS USP/HP XP/Sun 9990 or EMC DMX will have major failures?  I think I have seen guaranteed uptime of 99.99%.  Thats about 5 minutes per year downtime.  Pretty impressive if you ask me but what is downtime? 

How does mission critical storage remain mission critical with no downtime?  RAID 5 was popular in pre USP days along with Shadow Image and True Copy (for us HDS people).  Thats pretty darn good isn't it? 

Funny that all NetApp Filers are RAID 6 (or at least the ones that I have seen).  But what about your other important data that can take a few hours outage?  TC/SI needs more storage that is just too expensive.  So you look at the big arrays and think, that good enough for me.  I believe that 99.999% uptime that is on the glossies that are on the web and are handed out at meetings.  We all love that video where HP shoots an XP 12000 (oh what I would do for a gun and that system).  I feel pretty darn good about my decision.  Don't I?

What happens if you are using RAID 5 and one of your Array Groups has a double disk failure?  What if the LUN's in your Array Group with a double disk failure are used to make up a number of LUSE LUN's?  Do you know how long it takes for a USP to format an Array Group?  12 or even 18 hours?  By the time the backup is finished, that's probably 24 hours.  Did I say an hour or two was fine?  24 is not.

Why do some vendors/resellers/consultants insist that RAID 5 is good enough and RAID 6 is likely to degrade performance so why go that path?

Thats got to be Russian Roulette in my opinion.  Double disk failure in a RAID 5 Array Group with LDEV's used in many LUSE LUN's.  Thats a bullet with your name on it.  You can spin that chamber many times but sooner or later, thats it.  POW!!! BANG!!! ZAP!!  (Where is Batman when you need him?)

I wonder how many companies are passing the gun around?  At least RAID 6 would make the chamber much much bigger and it may even partially erase your name on that bullet.  Could anyone ever get a triple disk failure.  Probably about the same odds as me winning a big Lottery.

Stephen