SANs and deduplication

Como · Oct 25, 2017

Howdy!

Long time no see.

I've come to take the temperature of this place when it comes to network storage.

Back in the day, we overclocked hard- but like many others, once the CPUs got fast enough to keep up, I put my water cooling and whatnot away and just ran stock.
Months turned to years, years turned to a decade- next thing you know I'm packing my watercooling stuff up in a box inside a bigger box, which is quite lost inside a shipping container in the back woods of the state while I figure out what to do with my belongings...

A career with computers is not everything I wanted it to be, but I'm stuck here now... and looking for knowledge on how to solve a problem.

Deduplication, and block replication.

Many of you probably know of these- I'm curious how many are intimately familiar.
On a typical day, I have, say, 100GB of zip files come in. These zip files contain flat text- which compresses really well within a zip. Something like 10:1 in many cases.
All of that data is then unzipped, processed, and re-zipped... then archived. None of this is a problem, but the number of zip files every day is growing- and we process them a few at a time over the network.
As we scale this operation up, a few problems start to arise- namely, unzipping a 1GB file that can then become 10GB of flat text takes time, and space. We can save the space by unzipping to a compressed folder within NTFS, but then we're decompressing just to recompress in a different format- then the re-zip would decompress to compress again, adding a lot of cpu overhead and slowing down the operation.

What I'd like to do is offload that stress to a SAN. Yes, they cost tens of thousands of dollars- turns out, it doesn't matter.

Most SANs these days will have a write cache- flash or even RAM to buffer in written data. That data is then deduplicated onto disk- it calculates hashes for all of the blocks in the write cache, then looks for identical hashes in already stored data. Any matched hashes are a matching block, and instead of writing that block to storage it can just make a note that when the data is retrieved, that block is over there instead of here with the rest of the blocks.

Block level deduplication of uncompressed raw text SHOULD be pretty impressive- but I don't have the tools to get an idea of the actual compression ratio. If we stuck to 4k blocks, I would imagine that the compression factor would be somewhere in the 40:1 range, given a large enough archive of data to deduplicate against.

This makes replication pretty easy- if you bring in 100GB of zip files averaging a sad 3:1 ratio, that unzips to 300GB of flat text into that write buffer. de-duped 40:1, that's 7.5GB of data to sync- instead of 100 when zipped.
Still a lot to send over the wire, but it's feasible now.

Of course, those are my numbers- one of which is an estimate. What I'm looking for is real world experience with flat text and deduplication. I'm tempted to call a vendor and try to borrow a SAN, but vendor sales people are really hard to talk to.

EarthDog · Oct 25, 2017

Your best bet is to call, honestly. We used dedupe in our data center, however, it wasnt text files we compressed so i personally cannot directly help.

HankB · Oct 28, 2017

Time vs. Money. If money is not critical it seems like it would make sense to purchase a small SAN that supports the features that interest you and try it out on a small portion of your data. I guess you're asking here before you proceed which makes sense.

Another alternative is to build one yourself. I'm working with ZFS on Linux at the moment and it supports the features you named (compression and deduplication.) It's might even be the technology behind the SAN you are considering. One member has a bunch of 2TB drives in classifieds that would be suitable for this kind of experiment. It seems likely that you might have a retired server you could stuff the drives into that would be suitable for the experiment. The time comes in learning Linux and researching the necessary configuration options for ZFS along with the configuration of Samba. (I'm assuming you're working in a Windows shop since you mention NTFS.) ZFS is available on Linux and if you were to go that way I'd recommend Ubuntu 16.04 LTS. ZFS was born and raised on Solaris and I think there are free licenses available for personal use and perhaps evaluation. I'm assuming that you will proceed with a purchased Windows solution but if the compression and deduplication strategies are similar, results with ZFS should be similar to what you would experience with the purchased SAN.

nx2l · Dec 31, 2017

You could build your own SAN.

ZFS has support for dedup and compression and you can easily create block devices (called zvols).
Also with ZFS, you can tier storage... SSDs/flash for your ZIL and L2ARC to speed up sync writes and reads.
Will need as much RAM as you can throw at it so you dont experience slow performance using dedup.

Then just grab a qlogic FC HBA, put it in target mode... and export using targetcli.

.
.
.
I began the process of making mine recently.. just waiting on the FC hbas to come in the mail.

HankB · Dec 31, 2017

nx2l said:
You could build your own SAN.

ZFS has support for dedup and compression ...

Right you are. Como's description of the SAN capabilities sound an awful lot like what ZFS does. As I said the issue is really time vs. cost.

I'm using compression but not ZFS dedup as that requires more RAM than is in the box I'm using. There are other solutions. I'm using 'fslint' 'dedup' on a multiTB dataset that doubled in size when I copied it from one host to another (losing all of the hard links.) I don't know how far along it was when we had a 1/4 second outage this evening which reset the system about 5 days into the operation. It had already reduced disk usage by about 1TB but there should be another TB to recover. I plugged the box into my UPS and restarted the process.

SANs and deduplication

Como

Member

EarthDog

Gulper Nozzle Co-Owner

HankB

Member

nx2l

New Member

HankB

Member

Similar threads