[BBLISA] Backing up sparse files ... VM's and TrueCrypt ... etc

John P. Rouillard rouilj at cs.umb.edu
Mon Feb 22 14:04:30 EST 2010


In message <Pine.LNX.4.44.1002221026440.1779-100000 at citation2.av8.net>,
Dean Anderson writes:
>
>> .         50G sparse file VMWare virtual disk, contains Windows XP
>> installation, 22G used.
>> 
>> .         Back it up once.  22G go across the network.  It takes 30 mins.
>> 
>> .         Boot into XP, change a 1K file, shutdown.  Including random
>> registry changes and system event logs and other random changes, imagine
>> that a total of twenty 1k blocks have changed.
>> 
>> .         Now do an incremental backup.  Sure, you may need to scan
>> the file looking for which blocks changed, but you can do that as fast
>> as you can read the whole file once, assuming you kept some sort of
>> checksums from the previous time.  And then just send 20k across the
>> net.  This should complete at least 5x faster than before ... which
>> means at most 6 mins.
>
>But there is no /backup/ technology to do that now, that I know of. A
>checksum on the whole file won't tell you what /block/ changed.  One
>would need checksums on /each/ block.  I don't know of any backup system
>that does that. The backup log would be a significant fraction of the
>filesystem, or if sparse, a significant fraction of the data. Lets say
>you have a 1k block and you want to monitor for changes, and use a
>160byte mac to ensure no collisions on changes having the same sum. See
>the problem?  Not to mention the issue of computing the checksums during
>backup, looking them up in a database, which has its own overhead.  The
>backup system could become the major load on the server.

Well rsync based backup systems (e.g. BackupPC) do that using rsync's
algorithm for block comparison. The copy on the server is read to
calculate block checksums and any blocks between the server and backup
client that are different are sent to the server.

I have seen a significant reduction in network load. However for large
files the time to do the block comparisons seems to grow non-linearly
so it may not help much but for files < 1GB in size my test showed it
working nicely. When I modified 30 bytes in the middle of a compressed
file of 500MB in size I saw a ~20k or so transfered across the wire
(the rsync protocol and ssh protocol traffic was counted in that). So
much less than the 500MB that I would have needed for a tar or smb
backup.

Now the tradeoff is as Dean days high read I/O load on the server side
with some significant wait times.

One thing to remeber is that compression really screws up things. I
had to make some of the files I backed up uncompressed or else a small
change in the pre-compressed file would require backing up almost the
entire compressed file. I am not sure if bzip2 would do better than
gzip in this regard (depends on how bzip maps its blocks from the
input to the output).

I mention that because compression and encryption can often overlap in
the effect. Just because you are writing 1k of data inside a truecrypt
file doesn't mean that that maps to a single block outside. Also
depending on the algorithm it may chain multiple blocks together
similar to what gzip does with a similar effect. (Yes, before the
flames start I know the analogy doesn't overlap exactly but....)

--
				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.



More information about the bblisa mailing list