USP1100

The USPs went in beautiful and setup just like all of our past HDS boxes. We uninstalled an old 7700E disk subsystem and put the USP1100 in its place. We were able to reuse the power from the 7700E for the 1100 and all we had to run was new fibre, an ethernet connection and a phone line. Normally we would have had a phone line already there, but we had HiTrack from our xp1024 monitoring the 7700E and xp256 through coax connections, so we only needed one phone line for three boxes. Our HDS CE came up with that and saved a little money on B1s for the past few years. Nice, that guy is.

The 1100 came with 40 3+1 RAID 5 array groups and we reformatted them to 20 7+1 RAID 5 array groups. We’ve never found a reason to go with RAID 1 on any of our disk subsystems because we have been able to tune and layout our storage without any performance issues to our applications. We tell our customers that if your performance turns out to be degraded because a RAID 5 configured array group, then we’ll be happy to setup some RAID 1 array groups and let them rip on that. We just tell them that we know their DB vendor tells them they need RAID 1, but we know our hardware and it can handle it. (I’ll talk some more about how we setup our hosts with software and hardware striping for better performance in another post.)

Once we had all the array groups configured we setup the box with Virtual Partition Manager. This product allows you to segregate the workload between cache, array groups, and even front-end ports. We decided to go with a single Storage Logical Partition (SLPR) and two Cache Logical Partitions (CLPR). We decided to segregate the cache for mainframe and open systems to help prevent cache pollution that we had heard about at the Computer Measurement Group (CMG) conference many times before. We don’t know for sure if this is actually a problem, but it sounds technically feasible, and we have the software, so we might as well use it. We ended up putting the open systems in CLPR0 and the mainframe in CLPR1. The reason for this was that as we add storage and cache later, it will automatically go into CLPR0 the open systems partition, which is always our biggest usurper of storage. If we need it for the mainframe, then we can move it non-disruptively from CLPR0 to CLPR1 and keep on trucking. The software is really slick and if you read the manual, then you should be able to handle it. There’s no secrets or huge gotcha’s in the software. Really the only thing to remember is that the mainframe storage MUST be in SLPR0.

Next came the external storage implementation where we connected all of our other disk subsystems into the USP1100 to virtualize them. We connected our cx600, cx700, xp1024, and the AMS200. Again this was another simple thing to make happen. It’s as simple as selecting the ports you want to connect the other disk subsystems into and then make the ports “external” ports. Once you do that, the port begins emulating Windows 2000 to the other disk subsystems and you add the USP to the other disk subsystem the same way you would any other Windows 2000 box. Cake! Create a Host Group or Storage Group and assign some LUNs to it and away you go. I think it took us about an hour to externalize all four disk subsystems once we had everything planned out. One thing you must remember is that when you select a port to become an external port, then it’s paired port becomes an external port as well. For example, when you say that port 3C is an external port, then port 7C becomes an external port as well. Just remember that you lose two ports for every “external” port you create.

There are some things to think about when you’re doing it. Plan a reboot for each server as you virtualize their storage through the USP. For AIX and MSCS clusters you need to shut them down prior to trying to add their LUNs into the USP. Each of those OS’s put a reserve on the LUN and the USP can’t discover them until you shutdown the server. For AIX we were able to get it to work without shutting the server down by just varying off the disks, but on an HACMP box you must shut them down. Sun boxes with VxVM seem to definitely need a reboot. We haven’t done any HP-UX boxes yet, but they’re a pain in the arse anyway when you switch a disk port. You can read about the virtualization process below.

The only problem, and it’s not really a problem, that I can see so far with externalizing/virtualizing your storage is the huge amount of complexity your adding when you try and manage which device is attached to which server and on which disk subsystem the device resides. You are going to create a lot of spreadsheets and will editing quite a bit when you allocate some storage to a host. It’s not as simple as it used to be.

Next came Tiered Storage Manager. This is the software that allows you to seamlessly migrate data from one “tier” of disk to another “tier” of disk. The reason I have the word tier in quotes is because a tier can be anything you want. From as big as a whole disk subsystem down to a single device (LDEV). I have to say; this stuff really works! Not that I was skeptical or anything. =-)

We started off by virtualizing some storage for an AIX box, which was on the cx600. I added a Storage Group (SG) for the USP and then copied the AIX LUNs into the SG. Once the AIX guy was done varying the disk offline, I brought in the LUNs to the USP and assigned them to a Control Unit and gave them some addresses. I then created the Host Group (HG) for the AIX box and assigned the LDEVs to that HG. The last step was to zone the USP to the AIX box and take away the cx600 from it. He varied his disk online and was able to start DB2 and SAP like nothing had changed. Cake I tell you!

Next came the TSM migration of the disk from the cx600 to the AMS200. So we created our “tiers”, one for each disk subsystem. I did one for each disk subsystem, because you have to choose which tier you want to migrate to, and if both disk subsystem are in the same tier, then you can’t choose the one you want to migrate to. The next step was to create a Migration Group (MG) where you select which devices you want to migrate from and which devices you want to migrate to. For this case we had 16 LUNs that we were migrating.

I have to say that HDS kind of screwed the pair selection up a little. HDS goes ahead and picks 16 LUNs for us but they are all in the same RAID group on the AMS. This isn’t good because we would get better performance by using 2 LDEVs out of each of the four RAID groups we have available. To change the pairs is a long and drawn out process of the java web GUI refreshing and boy does it take forever to get it all done. It would be much better if I could just type in the device pairs myself, since I already know what devices I want the LUNs to migrate to. They should give us an option for them to auto select the pairs for me, or let me manually input them. But I digress…

I selected the pairs that I wanted to migrate and TSM created a migration task. You have the option to schedule the task or execute it right after TSM creates it. I wanted to execute it myself so TSM just created the task and informed me of the task number. So I went and executed my task and away it went. It completed successfully and the application guys were none the wiser that it happened. No disk errors or anything of that sort on the server. It works as advertised.

If you think about it, the disk vendors have been doing this exact same thing for years with hot sparing. The data is copied from one device to another and the new device retains the old devices address until it’s spared back. So they are allowing us to hot spare them out ourselves and charging us for it. ;-)

Some notes to keep in mind, TSM will only migrate 8 devices at once, so I watched it do 8 successfully and then do the other 8 successfully. The documentation says that TSM will do up to 64 devices at a time, so I created two migration groups later each with 8 devices in it and it copied all 16 at the same time. Just a little trick you can use to speed up your large data migrations. Also, getting back to the complexity of this capability. When you move a device from one disk subsystem to another the device address stays the same, so instead of having a defined range of CU’s on one disk subsystem, you can have any CU on a disk subsystem and it could change at anytime. Again this goes back to the knowing where a device is and which disk subsystem its on. It’s going to be pretty frickin’ complicated.

I have to say all and all everything is working great. Everything is working as advertised without any issues. Coming up in the next few weeks is Tuning Manager, Shadow Image, and Universal Replicator. I’ll try and post more as we go along. Let me know if there is something more you’d like me to go into more detail on and I’d be happy to.