Feb 13 2007

The Busy Writer: Backups

Published by matt at 13:16 under ,

This is the second in a serious of posts where Tom Colvin and I explore how The Busy Writer can make their process more robust in the face of uncertain technology.

Scenario

In this series of posts, I’m going to take Tom’s situation as a case study (unless some other writers chime in with scenarios they feel are substantially different):

For several years, I’ve been researching, and now finally writing, a rather huge, little-known story about a scientific/medical expedition sponsored by King Carlos IV of Spain. While I’ve been writing professionally all my life, I’ve never attempted anything of this scale before.

And Tom’s specific question regarding data backup:

How to be sure I’ve backed up all my research and writing, which, without vigilance, gets scattered all over my hard disk and into some online repositories.

Here, I’ll explore the philosophy of a good backup strategy, a series of increasingly robust solutions to the challenge of backing up digital data, and leave some teasers for things that are actually tricky to archive that I might come back to in a later post.

Backups are critical

In reply to my last post, perhaps the most critical question Tom asked about the writing process in the digital age has to do with backups. Your computer is no more reliable than your backup strategy. If you have no backup strategy, then your data is toast when the computer is toast. Crying, at that point, is a good strategy.

I don’t know exactly what kind of data lives on a writer’s computer, but I can guess. There are documents in a variety of formats, be it Word, or plain text, or AppleWorks, or Pages, or Framemaker, or Pagemaker, or InDesign… any of a host of possible applications generated the content that lives on The Author’s computer. Furthermore, there are webpages and PDFs that are saved all over to support that writing process, all references and notes of one sort or another.

What makes for good backups?

The first rule is they MUST be automatic. Why? Because you cannot forget to do automatic things—the computer remembers for you. If, every day at 9PM, your computer backs up all of your critical data to a Magic Archive (a magical computer Somewhere Else where your data is safe for all time) you can go to bed early and sleep soundly knowing that you aren’t in danger of loosing any work from that day. (Somewhere Else is a magical place that is removed from the worries and dangers of the world as we know it.)

Second, they MUST be off-site. If your house burns down, and your backups are in your closet… you didn’t achieve anything. Tornados, hurricanes, floods, locusts… anything that can destroy your computer can destroy your backups as well (if they are in the same place). Therefore, having a backup of your data that lives in the same room as the computer is only a partial solution, and doesn’t really represent a good backup strategy.

Third, good backups MUST be redundant. A single backup is unsafe for a variety of reasons, and therefore redundancy is one of your best strategies for ensuring the recoverability of digital data. Lets pretend you have your computer and one backup, perhaps a CDR or DVDR in a safe-deposit box in a bank Somewhere Else. Although Somewhere Else is technically safe from the dangers of the world, your DVDR might have been faulty from the start. Therefore, little did you know that your data was actually never backed properly—there are errors in the DVDR that you created. So, when your computer crashes, and you send for your DVDR, you can’t actually restore.

Finally, a good backup strategy MUST be tested. This means two things. First, it means that you should test the integrity of the data after it is backed up, and (perhaps more importantly), you should know you can recover your workspace from the backups that exist. For example, you could print all of your data to paper as zeros and ones, and your recovery strategy would be to type those zeros and ones back into the computer. This would be fairly robust, if you used good paper, with good inks, and stored the paper under ideal conditions. However, recovery would take decades.

Strategies: From quick-and-dirty to somewhat-robust

In light of these requirements for safe and reliable backups, I think there are several ways for The Author to proceed with backing up their data. I’ve used many of them at one point or another, and will try and highlight my frustrations with each as I go. I’ll also discuss the requirements in light of each different strategy, because it is often the case that automatic, off-site, redundant, and tested backups are difficult for us wee human beings to achieve on our own.

CDR / DVDR

Tom can back his data up to CDRs and DVDRs. They’re cheap, they’re easy to obtain, and there are lots of potential problems with them as a storage medium.

First, they’re not automatic. You need to load a disc into your computer and then tell it to burn a backup of all your data. Second, it is difficult to move the discs off-site. You can easily burn multiple discs to make things redundant, and testing of your strategy is difficult at best.

Part of Tom’s problem in using CDRs and DVDRs is this statement regarding how his data “… without vigilance, gets scattered all over my hard disk and into some online repositories.” If your data truly gets scattered everywhere, then there is only one acceptable backup solution: backup everything. If you cannot be sure that you didn’t save things in a weird, random place, then you must back up your entire computer, always, to be sure that you didn’t miss anything that is important. In this regard, we cannot use DVDRs; although they hold 5GB of data, a typical computer hard drive now is 80GB or more. Even if the computer only had a 40GB drive, it would still require 8 DVDs to back it up entirely—or 16 DVDs to back it up redundantly.

If we assume a little bit of vigilance, things get better. For example, Tom could do some organization of his data, creating a system that might look like this:

writing /
    20060315 /
      Article /
        Files for Article ...
    20070101
        Book /
            Files for Book ...
research /
    20070214 /
        Stuff found and saved on Valentines Day 2007 ...

First, he could create a series of dated folders, where each project ends up in a folder that is dated with the day it was started. Likewise, the research supporting a project is dated as well. This might not work in practice, exactly, as the supporting research for many projects will get scattered throughout all time. However, not all research is specific to one project… therefore, there is no one, good taxonomy for all of this supporting data. I’ll come back to this later.

At the least, though, if all of the writing and supporting material is under one or two folders on your computer, then you know that backing up the “writing” folder and the “research” folder captures everything you can’t live without. This is the minimum level of vigilance that you really need to get to. Now, things like bookmarks in a browser are problematic… but I’ll include those in due time.

Typical CDR/DVDR Backups
Criterion ?
Automatic No
Off-site No
Redundant With Effort
Recovery Likely tedious
Cost Drive ($20-$50) + $0.05/GB

(Cost based on ukdvdr.co.uk, 20070215)

Experience Report

I have a lot of backups on CDR/DVDR. These are hard to search and hard to manage. One of my projects for the next few months is to go through them, one at a time, and copy them onto my laptop. Then, I’ll synchronize that data to an online archive. Then, I’ll delete the data, and throw away the disk. The next time I want to “own” a copy of all of it, in one place, I’ll buy a hard drive (or whatever technology we have then), and copy it all off of the online storage site.

Regardless, these are hard to manage in the long run.

An extra hard drive

It is possible to set up automatic backups with an extra hard drive. You can buy one for cheap (compared to the cost of rewriting everything on your computer, from scratch, and from memory), plug it in, and have a second drive that is as large (or larger) than the one in your computer to begin with.

So, using software like that described at free-backup.info (for Windows) or at pure-mac.com/backup (for Mac), you could set up a nightly job that copies your entire internal hard drive to your external hard drive. You now have automatic, redundant backups, and can be pretty sure that you can easily restore from it, giving reasonable recovery even if you don’t test the process of main drive failure. However, when someone breaks in, they’ll take that external drive as well as the computer itself, so that doesn’t help you with the problem of having off-site backups.

Experience Report

Either way, an extra hard drive is a good step. If the main drive dies, you’re not out-of-luck. And, recovering should be straight-forward: get a new computer, plug in the external drive, and drag-and-drop stuff from your last backup. I say this from experience; when my Powerbook was stolen, I was fortunate to have a full disk image made with Carbon Copy Cloner from just a few weeks before. This, combined with some other tools that I use, meant that I lost no data that I could think of. It was still tedious to have lost the Powerbook, and I am still in the process of sifting through the external drive for things that I might want several months later… but it was some protection. Keep in mind, though, that the external drive was in the office, and the theft happened at home, meaning that it was technically an off-site backup.

Extra Hard Drive
Criterion ?
Automatic Yes
Off-site No
Redundant No
Recovery Manual, straight-forward
Cost $0.35/GB

(Cost based on pricewatch.com, 20070215)

Network Attached Storage (NAS)

Here’s one that’s growing in popularity: a special little box that you plug into your home network that just contains hard drives. In fact, some of these things are even wireless, and you just copy data to them.

NAS units are largely the same as the single, external hard drive, with one exception: you can get some redundancy from a good NAS box. You see, an external hard drive typically has one disk in it; a NAS unit can have two drives (or more), and those drives can be set up as mirror images of each-other. This is great, because (in theory), if one of the drives in the NAS unit fails, you can go buy another, insert it into the box, and it will automatically recopy everything onto the new drive—leaving you with two drives (in one box) that have two copies of your data.

Obviously, you still have the unit in your house (not off-site), but you can stick it somewhere other than your desk, meaning that it is less likely to be found by thieves when they break in to steal things. And it is somewhat redundant; while everything is in one little box, you do have your data on two drives instead of one. This provides some protection, but trust me: there are still ways for both drives to fail at once, killing your backup solution.

Amazon.co.uk has a whole section of their store dedicated to network attached storage. For example, the Buffalo TeraStation is one unit that allows multiple drives to be grouped together and turned into mirror images of each-other. There are, though, many others, and I am not about to go into all of the things you should concern yourself with when purchasing a NAS unit for your own use—not right here, right now anyway. For example, here’s one story of how a NAS might be used—including some off-site rotation. However, this requires commitment.

NAS (mirrored)
Criterion ?
Automatic Yes
Off-site No
Redundant Yes*
Recovery Manual, straight-forward
Cost $1.25/GB

(Cost based on Amazon.com, 20070215)

* By “redundant”, I mean “on more than one hard drive.” Obviously, it isn’t in more than one location in the world, so you can still loose everything in a house fire or tornado.

Online Backup: Bingo! and Amazon S3

There are lots of ways to back your data up online. Too many require too much complexity on the part of The Author. If a backup solution is going to work, it needs to be simple and straight-forward.

I see two viable ways of doing online backup right now: Bingo! and Amazon S3.

Bingo!

A solution using Bingo! would look like the following:

  1. Mount a network drive
  2. Use a backup program (Windows, Mac) to do an automatic backup of one or more directories on a nightly basis
  3. Sleep easier

This is a reasonably good solution; it costs a flat rate per year, and is certainly more robust than anything you can do yourself, in your house. That is, Bingo is using the Sun Fire X4500 series of data servers; this is basically a big NAS device, but it costs far more than you can afford; put another way, with 24 TB (where 1 TB = 1000 GB) of storage, an X4500 costs somewhere between $30,000 and $48,000. And to think, it doesn’t even have a 0-60 MPH figure you can quote to your friends at the bar…

So, the Bingo folks have spent big cash on good hardware, and are selling disk space. You can rent that space from them, and although it isn’t duplicated in more than one place around the world, it is far more reliable than any hard drive you can purchase for your home. And, the cost to you is cheap: $50/year for 25 GB of storage. This is probably more space than Tom needs to backup his critical documents and notes, but I could be wrong.

Using the space does look simple: you would right-click on “My Computer”, say “Map Network Drive,” and then enter the details you get from the Bingo folks. Likewise, on the Mac, you would do “Apple-K” form the finder, and then enter the information. Then, if you use an automatic backup program, it can copy things to that networked drive just like there was a second drive sitting on your desktop. That’s the point, of course—using the networked drive should be that easy.

Both backup and recovery are slow compared to a local hard drive; it must copy all your data over the Internet. Depending on your connection speed, this could take days the first time you do a backup. However, once you’ve done an initial backup, most backup programs will only copy things that change. This kind of incremental backup is quick, and often you’ll have only changed a handful of backed up documents in a given day, meaning that the end-of-day backup will only take a few minutes, even over the network.

Bingo!
Criterion ?
Automatic Yes
Off-site Yes
Redundant Yes*
Recovery Manual, straight-forward
Cost $0.50/GB (min $50/year)

(Cost based on bingodisk.com, 20070215)

* Although your data isn’t stored redundantly around the world, it is stored on more than one HD in a professional data center, which is far better than the NAS solution you might buy and put under your desk.

Amazon S3 + Jungle Disk

Another way to do online backup is with Jungle Disk. I’ve written about this previously. You can follow those instructions to get Amazon S3 setup. And, you can use it with automatic backup software just like Bingo!. The difference is that you only pay for what you use with Amazon S3, whereas you pay for a big chunk of space on Bingo, and get charged for it whether there is data in it or not.

For example, lets say you only have 5GB worth of data; that means a 25GB space from Bingo! is 5x bigger than you need (at the moment, anyway). This will cost you $50/year, flat rate. Another way to do this is to use Amazon S3, which costs $0.15 per GB per month. Or, put another way, your 5GB of data will cost $0.45 per month, or $5.40 over the course of a year. However, unlike Bingo!, Amazon charges you when you copy the data to or from their servers at a rate of $0.20 per GB. Assuming you copy your data once (to their server), it will cost you a total of $6.40 over the course of a year to store 5 gigabytes of data on Amazon’s servers. Of course, if I copy all of that data back to my computer on a regular basis, it will cost me $0.40 each time. (Really, the transfer costs for a typical backup scenario are absolutely negligible.)

Using JungleDisk, you can get a mountable drive (like Bingo!), and copy things to it. You can even use automatic backup software to synchronize data to the servers, meaning you can do incremental backups to Amazon’s servers. This cuts down on your costs—you don’t want to constantly copy all of your data to their servers every day. But what I really like is that Amazon’s solution copies your data to multiple hard drives in multiple places around the planet. At least, they say they do, and I suspect they’re not lying.

This is true redundancy.

And, the important thing for me is that JungleDisk now has a little backup utility built into it, so I can specify a few folders, and say “Backup Now”. It will automatically synchronize my folders to my Amazon S3 account—and will happily pick up in the middle of a backup if I quit, or shut down, or whatever.

Experience Report

I’ve committed now to using Amazon’s S3 service with JungleDisk, especially since they announced a roadmap and added the backup features. Also, with a likely $20 pricetag, I’m happy to buy the program when it goes “1.0.” The software does what I want, and I feel better knowing that my iPhoto Library, email, and critical documents are all backed up “somewhere else”. I think my mother is even using Amazon S3 now (I set my parents up with it), and that is a Good Thing. I sleep even better knowing that my parent’s machine is backed up, since it is probably my fault (somehow) if it crashes and burns.

Another neat feature of JungleDisk is the encryption. JungleDisk will, upon request, encrypt all of my data before sending it over the net and storing it on Amazon’s servers. This way, my data isn’t (casually) accessible to anyone who manages to intercept the data on its way to Amazon, or to Amazon themselves… however, I’m suspecting that they have better things to do than look at my photo archive.

Jungle Disk + Amazon S3
Criterion ?
Automatic Yes
Off-site Yes
Redundant Yes
Recovery Manual, straight-forward
Cost $0.15 GB/Month, $0.20 GB/transferred ($0.16/GB/year)

(Cost based on Amazon S3, 20070215)

(The cost model is a bit wacky, but simply put it is a one-time transfer cost and a monthly storage cost.)

Conclusions

A little organization on the desktop makes it much easier to back your data up to a second location; at least being consistent in saving work in one (or possibly two) places is a good start. Even if you don’t, it is still possible to back up many different directories using software on a regular basis, automatically.

Online backup is now cheap enough that it doesn’t make sense to do anything else if you really, really care about your data. If you just want to go spend $100 for an external drive, this is still a good start—but you can pay less, and get better reliability in the long run, by using an online service. Even if it costs you $200/year to back everything up somewhere else, that is worth it if your house burns down or someone breaks in and steals everything. And besides, storing it somewhere else makes it someone else’s problem to manage the maintenance of the hardware, not you.

Because, after all, you’re a writer, not a system administrator, damnit!

Looking Forward

What next? I might write a little bit about how to manage the explosion of data that comes from the research process. That’s a shorter post, I suspect. We’ll get to version control shortly, though. Or, perhaps Tom or someone else will have a comment or question on the article that will lead to the next post in this “thread”.

Update 20060215, 17:00: I discovered JetS3t, an open-source set of tools for browsing and synchronizing to the Amazon S3 filestore. I’m personally going to switch to these, despite being pleased with JungleDisk at the moment. My primary reason for this is that I have the complete source to the tools that are managing my backups if I use JetS3t. Therefore, I can (confidently) use the encryption features and know that the algorithm by which my data was encrypted is completely known to me… in a language I know, and that runs on all of my machines. (The JungleDisk equivalent is written in C#, and therefore only works under Windows at the moment, as far as I can tell.)

I’ll have to spend more time experimenting with these tools to give a conclusive report on which I think are the way to go. Certainly, I’m pleased with what I’ve seen of the JetS3t tools so far. Once I’ve committed and really know how they all work, I’ll post something on how I chose to do my backups. I’m pretty sure it will involve the Java tools, as they have command-line versions. However, JungleDisk will probably work fine for most uses.

Creative Commons License

This post is licensed under a

Creative Commons Attribution-Noncommercial-Share Alike 2.5 License.

(I’m reasonably confident that an automated weblog that harvests these posts solely for the purpose of generating advertising revenue could be called commercial purposes. My post yesterday was “harvested” and copied for just this purpose. Grr. And it was copied without attribution. Grr.)

4 responses so far

4 Responses to “The Busy Writer: Backups”

  1. Tom Colvinon 15 Feb 2007 at 16:42

    Matt, that’s a very informative post. Amazingly, our approaches are almost identical. There must be a lesson there.

    I’ve just written a description of my own back up practices at my blog at , filed under the “Backing Up” category.

    I look forward to continuing this discussion with you as we move through other topics we together identified.

  2. [...] Matthew Jadud at Sub Ubi Blog has responded to my query regarding writers and their computing habits with an in-depth discussion of back-up strategies for writers. [...]

  3. [...] Tom responded to my post on keeping backups; between the two posts, I think there’s a nice combination of information. As he points out, my comments are quite technical. I do my best not to get caught in the details, but it sometimes comes with the profession. It is also interesting to compare the posts: Tom’s is more discursive, while mine was more analytical—focusing on the details and mechanisms by which you could backup your work, and very little on how those mechanisms might fit into a writer’s workflow. [...]

  4. [...] In sequence: 1 2 3 [...]

Trackback URI | Comments RSS

Leave a Reply