Events/alarms and hopefully we get the right ones.

To be honest, Storage has always been my second choice of a vocation as my heart has been with Enterprise Management Systems for a long time.  I seem to go through cycles of it along with high end Solaris/Oracle work.  I actually thought that HSSM/HTnM was a good mix with what I do now.  That was until our Management Station was decided to be Microsoft MoM and some dinky (read cheap and not well supported) management solution that was not Openview or Smarts Incharge was used.

Today I was happily zoning away (while surfing for a new job) because I get a lot more to play with this weekend and I noticed that a change in zoning had not reflected at our remote site.  After a bit of investigation I found our Cisco ONS had something wrong with it and our second fabric was segmented.  I always thought that we could get away with one fabric as everything is dual pathed.  Well, only one system did have a problem and it was non production so I thought heck, it can wait. 

Sometime later I was told that I had my ISL's back which is interesting as our networking people swore blind my switches were at fault.  So I thought I would investigate if anyone actually knew what impact this had on our organisation.  Almost no one had a clue what had happened.  Not one event or alarm filtered through to our enterprise management people.  Sure there were a couple of port failures that appeared on the Brocade Fabric watch but we get them all the time because we are a Windows shop and every problem means a reboot.  So people don't take any notice of the events.  No event said "THIS PORT IS AN ISL AND IT IS IMPORTANT". 

Without getting into details, all was sort of ok (more like complete ignorance) even though I had a segmented fabric due to zoning mismatches.  So I decided to fix the segmented fabric.  Thats where everything went pair shaped.  There was a zoning transaction in the fabric that stopped the fabric merge.  This was discovered after the config on the remote switch was completely removed or so I thought and so did the switch.  This issue meant that the switch thought it had no config so it was a completely stupid switch for about two hours while this config issue was sorted out.  Things started happening in the SAN mainly due to database servers not really handling one path going missing - note to discuss this with HDLM people.  People started getting upset and a number of sev 1 calls were raised.  So after everything was fixed, we still had no events/alarms that suggested what had gone wrong.  The Brocade switch did say a fabric mismatch had occurred but just how many people in any organisation knows what that means.

So the moral of the story is what?  Ingorance is not bliss.  The most important things will most probably only happen once every few years and there will be new people when it happens again.  HSSM was of no value what so ever.  Microsoft MOM .. don't get me onto that.

This was my first segmentation exercise ever.  The major fault was that the switch did not tell me about this zoning transaction that was stopping the fabric merge.  I will take that information to my next job.

We as an organisation are used to SAN outages because we still have IBM DS4000 series arrays.... but not for much longer.