[BBLISA] Backing up sparse files ... VM's and TrueCrypt ... etc

Dean Anderson dean at av8.com
Mon Feb 22 10:59:12 EST 2010


> .         50G sparse file VMWare virtual disk, contains Windows XP
> installation, 22G used.
> 
> .         Back it up once.  22G go across the network.  It takes 30 mins.
> 
> .         Boot into XP, change a 1K file, shutdown.  Including random
> registry changes and system event logs and other random changes, imagine
> that a total of twenty 1k blocks have changed.
> 
> .         Now do an incremental backup.  Sure, you may need to scan the file
> looking for which blocks changed, but you can do that as fast as you can
> read the whole file once, assuming you kept some sort of checksums from the
> previous time.  And then just send 20k across the net.  This should complete
> at least 5x faster than before ... which means at most 6 mins.

But there is no /backup/ technology to do that now, that I know of. A
checksum on the whole file won't tell you what /block/ changed.  One
would need checksums on /each/ block.  I don't know of any backup system
that does that. The backup log would be a significant fraction of the
filesystem, or if sparse, a significant fraction of the data. Lets say
you have a 1k block and you want to monitor for changes, and use a
160byte mac to ensure no collisions on changes having the same sum. See
the problem?  Not to mention the issue of computing the checksums during
backup, looking them up in a database, which has its own overhead.  The
backup system could become the major load on the server.

Of course, the versioning filesystem doesn't do it that way.  It just
keeps pointers to a copy-on-write set of blocks that have changed,
rather just like virtual memory, starting with the root inode (actually
plural).  One only needs to compare the two root inodes to find what
blocks have changed between them. At some gross over-simplification,
just 'Lather rinse repeat' for the rest of the inodes in the filesystem.  
You get the point.

It would indeed be _nice_ if only the 20k changed were sent, but there
aren't many filesystems that /can/ indicate anything more than "the file
changed since the last time it was backed up". Hence the backup program
has to read/restore the entire file.  "Incremental backup" refers to the
whole filesystem, not the the blocks of files.

> .  If you do this with tar or dump ... even with compression ... still
> 22G goes across the net.  Another 30 minute backup.
> 
> Is it clear now?

Indeed.  To do what you want (only send the 20k that changed), one needs
a versioning filesystem to do that. Like AFS or NetApp's fs.  What you
want to do is intimately related to the capability of the filesystem to
track what blocks have changed. AFS, for example, keeps one version back
as the 'backup fileset' and an AFS incremental backup takes only the
blocks that are different from the backup fileset. NetApp keeps 10
versions but I don't remember how the NetApp backup works. There are
efforts in AFS to allow more versions.  I don't know of any other
filesystems that keep version information. Ordinary filesystems (like
FFS, EXT2,3, NTFS) don't keep track of what blocks have changed since
the last backup.

But your point should be well taken to FS implementors: We need
versioning filesystems.


		--Dean

-- 
Av8 Internet   Prepared to pay a premium for better service?
www.av8.net         faster, more reliable, better service
617 256 5494




More information about the bblisa mailing list