Howdy!
Long time no see.
I've come to take the temperature of this place when it comes to network storage.
Back in the day, we overclocked hard- but like many others, once the CPUs got fast enough to keep up, I put my water cooling and whatnot away and just ran stock.
Months turned to years, years turned to a decade- next thing you know I'm packing my watercooling stuff up in a box inside a bigger box, which is quite lost inside a shipping container in the back woods of the state while I figure out what to do with my belongings...
A career with computers is not everything I wanted it to be, but I'm stuck here now... and looking for knowledge on how to solve a problem.
Deduplication, and block replication.
Many of you probably know of these- I'm curious how many are intimately familiar.
On a typical day, I have, say, 100GB of zip files come in. These zip files contain flat text- which compresses really well within a zip. Something like 10:1 in many cases.
All of that data is then unzipped, processed, and re-zipped... then archived. None of this is a problem, but the number of zip files every day is growing- and we process them a few at a time over the network.
As we scale this operation up, a few problems start to arise- namely, unzipping a 1GB file that can then become 10GB of flat text takes time, and space. We can save the space by unzipping to a compressed folder within NTFS, but then we're decompressing just to recompress in a different format- then the re-zip would decompress to compress again, adding a lot of cpu overhead and slowing down the operation.
What I'd like to do is offload that stress to a SAN. Yes, they cost tens of thousands of dollars- turns out, it doesn't matter.
Most SANs these days will have a write cache- flash or even RAM to buffer in written data. That data is then deduplicated onto disk- it calculates hashes for all of the blocks in the write cache, then looks for identical hashes in already stored data. Any matched hashes are a matching block, and instead of writing that block to storage it can just make a note that when the data is retrieved, that block is over there instead of here with the rest of the blocks.
Block level deduplication of uncompressed raw text SHOULD be pretty impressive- but I don't have the tools to get an idea of the actual compression ratio. If we stuck to 4k blocks, I would imagine that the compression factor would be somewhere in the 40:1 range, given a large enough archive of data to deduplicate against.
This makes replication pretty easy- if you bring in 100GB of zip files averaging a sad 3:1 ratio, that unzips to 300GB of flat text into that write buffer. de-duped 40:1, that's 7.5GB of data to sync- instead of 100 when zipped.
Still a lot to send over the wire, but it's feasible now.
Of course, those are my numbers- one of which is an estimate. What I'm looking for is real world experience with flat text and deduplication. I'm tempted to call a vendor and try to borrow a SAN, but vendor sales people are really hard to talk to.
Long time no see.
I've come to take the temperature of this place when it comes to network storage.
Back in the day, we overclocked hard- but like many others, once the CPUs got fast enough to keep up, I put my water cooling and whatnot away and just ran stock.
Months turned to years, years turned to a decade- next thing you know I'm packing my watercooling stuff up in a box inside a bigger box, which is quite lost inside a shipping container in the back woods of the state while I figure out what to do with my belongings...
A career with computers is not everything I wanted it to be, but I'm stuck here now... and looking for knowledge on how to solve a problem.
Deduplication, and block replication.
Many of you probably know of these- I'm curious how many are intimately familiar.
On a typical day, I have, say, 100GB of zip files come in. These zip files contain flat text- which compresses really well within a zip. Something like 10:1 in many cases.
All of that data is then unzipped, processed, and re-zipped... then archived. None of this is a problem, but the number of zip files every day is growing- and we process them a few at a time over the network.
As we scale this operation up, a few problems start to arise- namely, unzipping a 1GB file that can then become 10GB of flat text takes time, and space. We can save the space by unzipping to a compressed folder within NTFS, but then we're decompressing just to recompress in a different format- then the re-zip would decompress to compress again, adding a lot of cpu overhead and slowing down the operation.
What I'd like to do is offload that stress to a SAN. Yes, they cost tens of thousands of dollars- turns out, it doesn't matter.
Most SANs these days will have a write cache- flash or even RAM to buffer in written data. That data is then deduplicated onto disk- it calculates hashes for all of the blocks in the write cache, then looks for identical hashes in already stored data. Any matched hashes are a matching block, and instead of writing that block to storage it can just make a note that when the data is retrieved, that block is over there instead of here with the rest of the blocks.
Block level deduplication of uncompressed raw text SHOULD be pretty impressive- but I don't have the tools to get an idea of the actual compression ratio. If we stuck to 4k blocks, I would imagine that the compression factor would be somewhere in the 40:1 range, given a large enough archive of data to deduplicate against.
This makes replication pretty easy- if you bring in 100GB of zip files averaging a sad 3:1 ratio, that unzips to 300GB of flat text into that write buffer. de-duped 40:1, that's 7.5GB of data to sync- instead of 100 when zipped.
Still a lot to send over the wire, but it's feasible now.
Of course, those are my numbers- one of which is an estimate. What I'm looking for is real world experience with flat text and deduplication. I'm tempted to call a vendor and try to borrow a SAN, but vendor sales people are really hard to talk to.