Following on from my posts about putting the hard disk in the freezer (and various other people saying they have lost data too) I think there must be a better way of handling backups.
The problem, as I see it, is that it is hard (or space-consuming) to backup everything. Some people (eg, myself) use rsync to make "incremental" backups, so that only files that need to be are copied to a backup disk. But even then, you have the issue that:
- This doesn't handle recovering from changed files. That is you accidentally delete most of the middle of a file. You don't notice for a day (or a week) and in the meantime the backup you had is replaced by the "bad" file.
- You are probably backing up the same file many times. For example, you make a copy of the Arduino download tree "just in case" before you change something. Now you have two virtually identical copies of everything.
- If you literally copy your original files, any deleted files are also deleted in the copy. And if you choose to add and not remove deleted files, then the copy directory is not a literal copy of the original.
What I would like is to have a single copy of each file, even if the file happens to occur many times on my hard disk.
Reference:
Also Google: "deduplication backup software".
There seem to be a few (quite a few?) commercial products out there that do something like this but I would rather trust my backups to something open source, and not relying upon some proprietary product that may or may not be around, or work, or still have a working licence for, when I need it.
My general plan would be to do this:
- From a specified starting point (eg. the root directory, or your home directory) "walk" the directory tree recursively.
- For each file encountered, take the hash of its contents (eg, using SHA, MD5 or similar)
- Assuming that if two files with the same hash are the same file, we only need to copy files with different hashes (if the copy does not already exist).
- Check our "backup" disk/directory/server for an existing file of that name (the hash), eg. 9E70BE30221CE738537B82EB3EDC46D4.backup
- If the file doesn't exist, we assume it was never backed up (in its current form) so we copy from the source file to 9E70BE30221CE738537B82EB3EDC46D4.backup (or whatever) in the backup directory.
- If that file does exist, we skip the copy - it is already backed up.
- We need to be able to get our directory structure back when doing a restore, so for each file (whether or not it needed to be copied) we add an entry to a "directory structure" database. For example, a SQLite3 database. This entry records: the original file name, the owner directory, the hash, the size, date/time modified, attributes (eg. read-only, hidden) and permissions (eg. owner/group)
If a restore is required, we simply consult the "directory structure" database to work out which files need to be copied back. The backed-up file (eg. 9E70BE30221CE738537B82EB3EDC46D4.backup) is located on the backup disk, and copied back under its original name. Its permissions etc. are set back to what they were originally.
This system would let you back up multiple computers onto a single backup disk. Any files which you happened to share (eg. movies, photos, music) would not be backed up twice as they would hash to the same thing.
The only real overhead would be the database used to recreate the directory structure. Still, you could clean those up from time to time. And to the extent that you keep them, you can always get your file system back "as at" a previous data. I suspect that is roughly how the Mac's Time Machine system works.
So, has anyone already done this, does anyone know? I mean as open source, not as a $1000 product with a $100 annual maintenance fee. Per PC.