Deduplication: You Must Unlearn What You Have Learned

Ron Singler

Apr 18, 2016 • 6 min read

Ah yes, Yoda. I have been using that quote from Yoda for the last three years when describing the way that people think about infrastructure and the way they had designed infrastructure for their data centers in the past. Forget about what they had pieced together in the past because they needed to start thinking about the way they designed their infrastructure as they looked to the future.

I think it is now time to turn this conversation to the storage technology of deduplication.

Clarke's Third Law

Any sufficiently advanced technology is indistinguishable from magic.

TL;DR - SimpliVity deduplicates IO and data efficiency is a result of that IO deduplication.

Deduplication was invented as a method to increase the available capacity of the storage products that our data was persisted to. The first commercially viable products utilizing deduplication were backup products. Diligent Technologies Corp. (purchased in 2008 by IBM and is named ProtectTier) released the first Virtual Tape Library (VTL) focused at the large enterprises of the world and we found Data Domain (purchased in 2009 by EMC) focused on the SMB and Mid-Market in the beginning. They were both very successful companies and we saw almost every backup platform add deduplication as a feature at the end of the last decade and beginnings of this decade.

Earlier this decade we saw primary storage arrays begin to add deduplication as an add on feature, although the deduplication was ALWAYS done post process, which means the deduplication was done after the data had landed on persistent storage. The performance penalty was simply too great to do anything else.

In 2009, a company named SimpliVity saw what was taking place in data centers, people implementing numerous products to attempt to manage all of the copies of data they required, and they started thinking about how they could "Simplify IT" by helping customers remove complexity from their data centers. VMware was taking on Tier 1 applications in many data centers and had become the go-to CPU and memory consolidation tool for x86 infrastructure.

SimpliVity saw the prime opportunity to do the same thing for the data infrastructure components in data centers (primary storage, backup deduplication appliances, WAN accelerators, and backup software). As a result they decided that they wanted to take that problem head on and solve it. They researched many open source file systems at the time, including Google FS and ZFS, and decided that none of them would solve the problem that they were trying to solve so they decided to invent their own Object Store and File System. I affectionally call our technology the "Data Hypervisor". You'll see why as you read on.

Deduplication of IO, Not Data

At the root of SimpliVity File System and Object Store (SVTFS) is the ability to deduplicate all duplicate writes at inception without performance degradation. Read that again. Let that sink in. This is the paradigm shift.

Because SimpliVity acknowledges and deduplicates a write IO (and reads, but writes are most important here) at inception, in NVRAM (cache) on multiple nodes, before the write ever hits persistent SSD or HDD that allows for all of the other things that SimpliVity does to take place. From there, many other inventions were created and patented to make this possible but to also provide predictable performance in the data center.

Predictable performance. Something that SimpliVity gives you with their OmniStack Accelerator. Put simply, this is the ability to provide storage services without utilizing the same processors that your VMs are also trying to use.

SVTFS and the OmniStack Accelerator together are what SimpliVity calls the Data Virtualization Platform (DVP) (Check out a great deep dive white paper here: Link). Because the DVP deduplicates writes at inception the DVP owns the write and metadata associated with that block of data. From there the DVP manages that block throughout it's entire lifecycle. The DVP never duplicates a block of data. Never. Ever.

Clones

Clone - A point in time copy of a VM. The clone is a separate VM.

Now that we understand that the DVP never duplicates blocks of data it's easy to understand that SimpliVity has essentially removed the need for more IO when creating copies of a VM. Since the DVP has already stored all of the blocks that make up a VM within the DVP, when we want to do a clone the DVP simply creates a full copy of that VM in metadata (not dependent upon the base VM) and presents the clone to the hypervisor. That's it. No IO, no additional disk space, nothing but metadata.

Backup

Backup - A point in time full copy of a VM. The backup is not dependent upon the base VM.

Again, the DVP never duplicates blocks of data. Since a backup is merely another copy of a VM in metadata, the DVP doesn't generate more IO or utilize more physical disk space when running a backup. All SimpliVity backups are full backups. The only difference between a clone and a backup is that the backup isn't presented to the hypervisor. Again, simple.

Note: SimpliVity backups are NOT snapshots. They are independent full metadata copies of a VM without any dependency on the original copy. In fact, you can delete the VM and the backups remain in the Federation until they fulfill their retention period. See this link for a deeper dive: Link

Replication

Replication - the act of sending a point in time full copy of a VM from one data center to another.

Replication for the DVP is quite simply a full backup of a VM from one site to another. Where this gets interesting is our DVP platform is not just local, this is a global function that spans all data center objects. What??!Yep!!

The DVP actually spans all sites within a SimpliVity Federation and understands where all blocks of data are stored throughout the entire Federation of nodes. Because of this, when we do a full backup of a VM from one site to another the DVP only sends unique blocks of data that don't already exist within the DVP at the remote site. It is because of this unique functionality that the customer can remove any WAN acceleration device that is only in place for replication traffic.

Analogy - Think of each VM as a Lego set and the metadata for that VM as the instruction booklet that comes with each Lego set. When replication is done the DVP sends a copy of the instruction booklet for the VM to the target site. The DVP at the target site already knows what Lego blocks it has and then informs the source what unique Lego blocks it needs to pull to make a full logical copy of the VM at the target site. The target then pulls only the unique blocks across the wire and assembles the new point in time logical copy of that VM.

Summary

Well there you have it. SimpliVity deduplicates IO and data efficiency is just a result of that work. The truth is that most customers think that what SimpliVity does is magic. It's not, but it sure seems like it. SimpliVity took the time (43 months) to actually fix a big problem in the data center. Others took open source file systems and rushed a product out the door for VDI.

Lots of folks in the HCI market claim they have deduplication. They do. It's either near-line or post process. Or maybe even partially inline for a subset of the data or if they're controller isn't too busy. No one in the HCI market, except SimpliVity, can claim to deduplicate IO inline for all data, at inception, once and forever, across all tiers of media, globally and provide predictable performance of the IO. If they do, call them out on it - no one has time for smoke and mirrors.

If you would like more information, a deeper dive, or an actual live demo please reach out to [me](mailto:ron.singler@simplivity.com?subject=Blog Post) or your local SimpliVity partner or sales team.

Sign up for more like this.