BArch Design Notes


Problem: tar, when doing listed incrementals, scans the entire filesystem, memorizes the list of files, and then writes them to the archive, creating potential inconsistencies and warnings.

Solution: barch writes out the directory listings _after_ the files within the directory. This allows the directory memorization to be done in a single pass at the same time as actually archiving those files.


Problem: tar has problems with restoring directories that originally had no write bits set.

Solution: since barch writes out the directory listings after, the modes on the directories are set after restoring the entire contents of the directory.


Problem: when backing up changed files, cpio, tar, or zip will re-archive the entire file, even if only small parts of it has changed.

Solution: barch will (eventually) use an algorithm similar to that of rsync to encode only the differences between the previous known data and the current data. The first implementation of this algorithm will only do a common prefix scan. The checksum values will be stored in a CDB table along with the directory listings. To prevent this table from becoming huge, the checksum block size will be set large, at 4K or 8K (rsync prefers an approximately 500 byte block size), with 20 bytes per checksum (32 bit fast checksum, plus 128 bit MD5 hash).


Problem: tar fails to properly archive files that change length. If the file shrinks, it gets padded with NUL bytes, and if it expands, the extra data is ignored.

Solution: barch writes records to the archive in length-prefixed chunks. This solution also accomodates record data for which the length is not known when the prefix is written.


Issue: should filenames be written to the archive as (directory ID, file name) pairs, or as full filenames?

The first suggestion makes the archive metadata smaller by shrinking the file name information. It also makes the decisions regarding if each member is within a listed path to extract much simpler.

The second suggestion is best for reliability -- if a directory entry is lost, we can't get out of sync with the filesystem.

Resolution: use full filenames.


Barch, when extracting files, writes all data to temporary files which are then renamed (by default). This prevents the files from being either inconsistent or corrupted while data is being extracted.


I feel that tar does too much that could be better achieved through external programs. The reblocking facility would be better served by an external program, such as dd or multibuf (which also does streaming) which can have many more options than tar does. The reblocking facility will be explicitly omitted from barch, and possibly other features.