Dynamic Provisioning: The 42MB page unravelled
DISCLAIMER: The opinions expressed in this post are just that, my opinions. I do not work for a vendor or a partner and therefore do not speak authoritatively. However, I do have fairly broad experience and knowledge of Hitachi storage.
NOTE: Ive put this together quickly as Im going to be busy for the next fews and my wife is also about to give birth (couple day overdue already). So apologies if it doesn't read as smotthly as Harry Potter.
Firstly, a paragraph for the benefit of those who don't already know - Dynamic Provisioning, as implemented by Hitachi (HDP on the HDS USP-V, ThP on the HP XP24000) is based around a basic allocation unit of 42MB. HDS refer to this as a Page . Essentially any time a host writes to a previously unallocated area of a LUN, the array will allocate a new 42MB page to that LUN. So, as data is written to an HDP LUN it will be grown in units of 42MB. OK.
Now, unless its changed, this 42MB is taken from contiguous stripes from a single LDEV in a single Array Group. This basically means that each of these 42MB pages is allocated from a single Array Group (8 disks). Subsequent pages allocated to the LUN will probably be allocated from another Array Group in the Pool..... So over time your LUN should have its blocks nicely balanced across all spindles in the Pool. A form of wide striping.
DISCLAIMER: Actually I should point out that I don't work for Hitachi, HDS, HP or Sun. So Im not an authority on this topic or the USP/XP (see my previous post for where Im coming from). Despite this it should be an interesting pos. Also microcode changes happen all the time and the guys at the factory don't mail me about them, so tweaks to the algorithm will at some point, if they have not already, render some of the above information out of date.
However, for the purpose of this discussion, all we really need to know form the above paragraph is that space is allocated in units of 42MB called pages. To date this has not changed.
So.......... the USP-V/XP 24000 is pretty flexible and supports the following hardware RAID levels -
- RAID10 2+2
- RAID10 4+4
- RAID5 3+1
- RAID5 7+1
- RAID6 6+2
In addition to the above, we must also consider the ill-named but quite impressive Concatenated Array Group (it is actually a stripe). These allow you to join 2 or 4 RAID5 7+1 Array Groups to get wider back-end striping.
So if we include these then we have an additional two RAID levels to consider -
- RAID5 14+2
- RAID5 28+4
As we can see, the USP-V gives us all three of the popular RAID levels (1, 5 and 6) as well as scope for customisation of each, giving a total of 7 possible RAID configurations.
Now then, track size for OPEN-V volumes on the USP-V is 256K. However, the USP-V RAID controllers write two rows per spindle per stripe. In other words, when writing a full stripe the USP-V will write two tracks per physical spindle before moving on to the next spindle in the RAID set (need diagram). So you could say the effective chunk size per spindle is 512K. With this in mind we can calculate the stripe size for all of our above RAID configurations using the following calculation -
x = a * 512
Where x the stripe size and a is the number of data spindles in the RAID set.
Based on the above calculation, the respective stripe sizes of each of the above RAID configurations is as follows -
- RAID10 2+2 = 1024K
- RAID104+4 = 2048K
- RAID5 3+1 = 1536K
- RAID5 7+1 = 3584K
- RAID5 6+2 = 3072K
- RAID5 14+2 = 7168K
- RAID5 28+4 = 14336K
42MB = 43008K
Still with me? Fairly boring I know. However.......
Now for the slightly more interesting part (I stress the slightly). You will find that every one of the stripe sizes listed above divides perfectly into 42MB. If you dig further, you will also find that 42MB is the lowest number that all of the above stripe sizes divide perfectly into, without generating a remainder. I'll spare you the spreadsheet because its too wide to fit comfortably on this screen, but feel free to check it out for yourself.
Then if we factor in other things such as cache slot and segment size as well as pre-fetch (based on tracks (256K) and multiples thereof) they all also divide perfectly in to 42MB.
Also with the USP-V being track centric/cache slot centric (both 256K), it tends to internally map and manage things, such as external storage, in multiples of this track/slot size. Again, divides perfectly.
Interestingly, you can also divide 42 by 2,3,4,6 and 7. These numbers equate to the number of data spindles in all basic supported RAID configurations (not including previously mentioned concats). To be honest, knowing a little about how clever, efficient and thorough the developers in Japan are I expect that the 42MB Page size maps to a lot more internally. In fact I wouldn't be surprised if the number of screws used to build each internal disk chassis was divisible by 42..... ;-)
Further to this internal mapping... Let's not forget that because HDP is borrows a lots from the Hitachi implementation of Copy-On-Write technology (COW) I will refer to the operation of allocating a new 42MB page to a LUN as a Borrow-On-Write, or BOW, operation.
Each time a BOW operation occurs there is overhead along the lines of the below -
- Search the free page table for the next available free page (if there is now logic on top of this to more evenly spread the load then the overhead will be more)
- Update the Dynamic Mapping Table (DMT) and the free page table
- Map the page into the DP-VOLs allocated page table
- Make the blocks available for access
(some of the terms used above are probably my own and not official)
May be there's more to the BOW operation, may be there's less? But it won't be too far from that mentioned above.
So...... with the above in mind, the smaller the page size, the more often these BOW operations will be required when growing a volume. Each time incurring a small overhead, so the less frequently they happen the better. Of course these operations only occur when a new page is demanded.
Also, albeit probably hardly worth mentioning, the DMT (that's an official HDS term and not my own) is another layer that must be traversed in order to map LBAs in a LUN back to to blocks within a Pool for normal read and write requests that dont require allocation of a new page.
Anyway, if page size was smaller, the DMT would of necessity be larger due to there being more pages per Pool. As a result it would take longer to search and update. And when you think that each pool can have millions of pages, the DMT could get quite large. Take the following as an illustration -
A USP-V, in all its glory, can have over 140 internal Array Groups and each Array Group can be comprised of 8 x 300GB spindles. Each 300GB Array Group formatted to RAID5 (7+1) yields around 1.8358TB usable Base 2. This gives each Array Group 45,832 x 42MB pages. Multiply this by the possibility of, lets say, 140 Array Groups gives you 6,416,480 x 42MBpages that all need to be represented in the DMT. And that's not to mention External Storage which can also be Pooled for HDP. So a smaller Page size would significantly increase the size and efficiency of the DMT.
For those of you still reading, thanks, and I'll leave the theory there for now. In theory the ingredients are fine, but the proof is in the pudding -
At the end of the day, from experience in the field and from knowing a little of how well aligned it is to the internal structures and workings of the array, I think the 42MB page works very well.
And even after all is said and done, I trust that the Hitachi guys in Japan know far more than me about how their kit works, and for that I'm sure they also know far more about their own kit than their competitors know about it too.
Now on a side, there are people out there, their names tend to be "Barry ", who like to point their fingers and laugh at Hitachi's supposedly large and chunky 42MB page. One of these Barry's, when questioned on the extent/page size chosen by his own company's Dynamic Provisioning offering was shamefully quiet. I say shamefully considering his previous deafening criticisms of others choices as well as a promise to tell us once his engineers had decided. If I remember right, the most we ever got from him was some spiel about it being aligned with the internal workings of
the Symm their array. Makes one wonder if he has something to hide or some backtracking to do?
It appears that may be the reason why Hitachi chose such a large page size (and size is a matter of perspective) was because of how flexible the USP-V is and how many different configuration options there are, all of which need to be considered and mapped to.!?
It works and it works pretty well, thats all I can say for it.
PS. Im open to anybody shedding any further light on the topic at hand, feel free to comment.