A better method of doing backups?

Following on from my posts about putting the hard disk in the freezer (and various other people saying they have lost data too) I think there must be a better way of handling backups.

The problem, as I see it, is that it is hard (or space-consuming) to backup everything. Some people (eg, myself) use rsync to make "incremental" backups, so that only files that need to be are copied to a backup disk. But even then, you have the issue that:

  • This doesn't handle recovering from changed files. That is you accidentally delete most of the middle of a file. You don't notice for a day (or a week) and in the meantime the backup you had is replaced by the "bad" file.
  • You are probably backing up the same file many times. For example, you make a copy of the Arduino download tree "just in case" before you change something. Now you have two virtually identical copies of everything.
  • If you literally copy your original files, any deleted files are also deleted in the copy. And if you choose to add and not remove deleted files, then the copy directory is not a literal copy of the original.

What I would like is to have a single copy of each file, even if the file happens to occur many times on my hard disk.

Reference:

Also Google: "deduplication backup software".

There seem to be a few (quite a few?) commercial products out there that do something like this but I would rather trust my backups to something open source, and not relying upon some proprietary product that may or may not be around, or work, or still have a working licence for, when I need it.

My general plan would be to do this:

  • From a specified starting point (eg. the root directory, or your home directory) "walk" the directory tree recursively.
  • For each file encountered, take the hash of its contents (eg, using SHA, MD5 or similar)
  • Assuming that if two files with the same hash are the same file, we only need to copy files with different hashes (if the copy does not already exist).
  • Check our "backup" disk/directory/server for an existing file of that name (the hash), eg. 9E70BE30221CE738537B82EB3EDC46D4.backup
  • If the file doesn't exist, we assume it was never backed up (in its current form) so we copy from the source file to 9E70BE30221CE738537B82EB3EDC46D4.backup (or whatever) in the backup directory.
  • If that file does exist, we skip the copy - it is already backed up.
  • We need to be able to get our directory structure back when doing a restore, so for each file (whether or not it needed to be copied) we add an entry to a "directory structure" database. For example, a SQLite3 database. This entry records: the original file name, the owner directory, the hash, the size, date/time modified, attributes (eg. read-only, hidden) and permissions (eg. owner/group)

If a restore is required, we simply consult the "directory structure" database to work out which files need to be copied back. The backed-up file (eg. 9E70BE30221CE738537B82EB3EDC46D4.backup) is located on the backup disk, and copied back under its original name. Its permissions etc. are set back to what they were originally.

This system would let you back up multiple computers onto a single backup disk. Any files which you happened to share (eg. movies, photos, music) would not be backed up twice as they would hash to the same thing.

The only real overhead would be the database used to recreate the directory structure. Still, you could clean those up from time to time. And to the extent that you keep them, you can always get your file system back "as at" a previous data. I suspect that is roughly how the Mac's Time Machine system works.

So, has anyone already done this, does anyone know? I mean as open source, not as a $1000 product with a $100 annual maintenance fee. Per PC.

Well my method my leave a lot to be desired but it has served me well for almost 5 years now. Having a Windows XP system I've always felt very vulnerable with all the malware, faulty programs, and other badness things that can happen to a windows system and because I was using my only PC also for some home-based independent contractor work I felt I needed something bullet proof.

So weekly I would back-up any work related MS word files I had created to a external USB drive (and which were also saved on the clients web based directory folder once created) and once a month I would create a mirror image clone of my system disk (500gig) to an internal Seagate 500 gig drive that I had mounted into a removable cartridge drive set-up that was normally powered off. Seagate has a free disk utility (that only works if the destination drive is a seagate model) that includes this cloning functions and has worked very well and saved my butt at least 2 or 3 times now. This also allows me to be a little braver then I might otherwise be when trying new software out and other 'risky' trials or experiments, as I am not very good at troubleshooting and repairing Windows problems. Anyway this method of cloning the complete drive is pretty much fool proof for this fool's machine and habits. Only takes about 20 mins or so.

Lefty

What I've been using to backup my systems (all running Ubuntu) has been "Simple Backup System" - and I set it for "logarithmic" purging; I'm not sure this is necessarily what you want or need, but it has worked well for me. If you used a versioning system for your code like SVN or CVS, it should be all you need (well, that and a place to back your files up, of course).

One thing I don't like about the hashing scheme it that you wind up with files that are hard to retreive without the backup infrastructure.

I way prefer methods that either just save files as is or at most ZIP them into a single archive.

I do the following...

  1. During the day I run a program every 5 minutes (automatic), this looks at my working folders and does an incremental backup of changed files. Every few days I clear the older archives.

  2. Every evening I run a sync prog that syncs my "work" directory with an equivalent on an external drive. This only copies one way and never deletes so the external version has everything I've ever done. If a folder gets renamed it is essentially duplicated on the external drive.

  3. Every now and then I repeat the sync process onto an offsite external drive.

  4. I never backup any system files, programs etc. I figure that if I need to rebuild because of a virus I wouldn't trust it anyway. This was a pain the other day because I had to reload all my programs and lost default settings etc, but it's a good excuse to stop the program rot that sets in after a year of two.

  5. As far as possible I never let programs use those stupid "My xxx" folders, everything is in a "work" folder on D drive.

The only problem I have with the above is that the sync process doesn't save older versions, once you sync you've lost the past. I should probably change that to a full backup and daily incrementals.


Rob

I let the cloud take the strain and have all my documents in Dropbox.

They just get synced in the background automatically and obviously its off site.

As far as roll-back type backups, I also let Apple TimeMachine do its magic. So that when I screw something up I can revert.

At one point I figured that all the backup methods are either complicated, prone to error or otherwise not really satisfying. Now I settled to buying each drive 3 times. Then I will do a backup with dd every once in a while of everything. The new backup goes into a fire proof box behind a fire proof door (different part of building). The old backup goes back and will sit close to the original till I consider it time for a new backup.

Very important data gets specific backups on USB sticks and/or web storage.

The whole idea is that in case of a crash I will recover the OS basically immediately and recover the "important files" from the short term copies. All else will get lost. IMHO hard drives are cheaper than the potential efforts for restoring data.

Did I mention that I used more sophisticated backup approaches which failed when I needed them. Then I had to restore the files from the "unrestorable backup". Hence I will now stick to binary backups. There is lots of literature for different approaches but this one works for me so far pretty well.

Nick, have you had a look at Dirvish?

I've been using it for years, used it to do data backups at one of my last jobs. It uses rsync to do delta-backups and hard-links all files that stayed the same. You can specify expiry rules... It is also quite simple to add md5 or sha checksums for each file by adding a post-backup script yourself (1 line).

I run this backup daily on my web-server. The whole hard-linking business has the advantage that each instance of a backup show the full content to the user and it is also possible (with rsync of course) to copy the whole backup folder to another disk - preserving the hard-link structure.

The master config file looks something like this:

bank:
        /Backup

image-default:  %Y-%m-%d--%H-%M-%S
log:    gzip
index: gzip

exclude:
        *.tmp
        *.aux
        */Cache/*
        */cache/*
        */tmp/*
        */Tmp/*
        */Temp/*
        */temp/*
        */Cookies/*
        */cookies/*                                                                                                                        
                                                                                                                                           
expire-rule:                                                                                                                               
        wday {mo-sa} +7 days                                                                                                               
        wday {su} +4 weeks                                                                                                                 
        mday {1} +6 months                                                                                                                 

expire-default: +14 days

rsync-option: 
        -a 
        --acls
        # --bwlimit=16384
        --numeric-ids

Runall:
        etc
        home
        var_mail
        var_www

The 'dirvish' and 'dirvish-expire' scripts are simply started by cron. Inside the 'BANK' folder reside sub-directory with the individual backups you need:

/Backup/----+---folder_1------+---2012-01-01--14-00-00------+---dirvish_md5sums.log.gz
            |                 |                             |---index.gz
            |                 |                             |---log.gz
            |                 |                             |---summary
            |                 |                             |---tree/
            |                 |---dirvish/------+---default.conf
            |                 |                 |---default.hist
            |                 |---init.sh
            |                 |---incremental.sh
            |                 |---md5_check.sh
            |                 |---md5_check_today.sh
            |
            |---folder_2
            .
            .
            .

The default.conf for each backup-job may look like this:

client: fully.qualified.host.name.com
tree: /var/lib/mysql
xdev: 0
index: gzip
#pre-server:
pre-client: /var/lib/mysql_db_dumps/dump_databases.sh;
#post-client:
post-server: cd ..; find ./tree -type f ! -name "dirvish_md5sums.log" -exec md5sum {} \; |gzip > dirvish_md5sums.log.gz

and the history file:

#IMAGE  CREATED REFERECE        EXPIRES
2012-01-26--17-30-25    2012-01-26 17:30:28     default +7 days == 2012-02-02 17:30:25
2012-01-27--06-32-51    2012-01-27 06:33:26     2012-01-26--17-30-25    +7 days == 2012-02-03 06:32:51
2012-01-28--04-37-31    2012-01-28 04:39:03     2012-01-27--06-32-51    +7 days == 2012-02-04 04:37:31
2012-01-29--04-26-46    2012-01-29 04:27:21     2012-01-28--04-37-31    +4 weeks == 2012-02-26 04:26:46
2012-01-30--04-21-12    2012-01-30 04:21:54     2012-01-29--04-26-46    +7 days == 2012-02-06 04:21:12
2012-01-31--04-30-41    2012-01-31 04:32:16     2012-01-30--04-21-12    +7 days == 2012-02-07 04:30:41
2012-02-01--04-40-08    2012-02-01 04:41:38     2012-01-31--04-30-41    +6 months == 2012-08-01 04:40:08

Of course this does NOT provide any bare-metal recovery at all, but it allows to run backups daily without wasting a whole lot of space due to duplication.