Sunday, 9 November 2014

Compatibility: the challenge for digital archiving

'5 1/4 floppy disk' by Rae Allen
under a CC license
Today I've spent a good couple of hours in migrating some 15-year old e-mails of mine from a legacy e-mail client to Thunderbird. It wasn't a difficult process but it did need a bit of research to figure the steps needed to do the job and to try a couple of suggested alternatives. Last week I had a (shorter) adventure in getting text from Word 2.0 and Wordperfect documents. Maybe 15 or 20 years is too much time for the digital world but that won't stop me from mentioning - again - the challenges of forward/backwards compatibility in media and formats.


Whoever has been using a computer for a fair amount of time is probably aware of the advice to backup their data. He/she may not be actually following that or may not even know how to do it but he/she is very likely to have heard the advice.

There are plenty of reasons to backup one's data. The main one, of course, is security against data loss due to:
  • hardware failure (e.g., hard drive damage)
  • disaster of any kind
  • user error (e.g., file deleted + trash can emptied + free space wiped, file overwritten, etc.)
  • malicious act (e.g., file destroyed by malware of any kind), etc.
For the enterprise environment, backup is (supposed to be) a must. In certain countries, the backup of specific corporate data is mandated by law. Regardless of that, corporate backup tends to be more comprehensive, maintaining data versions, multiple copies, distribution of copies across different media and locations, ideally both on-site and off-site, etc.

Corporations that depend on their data or need to keep a digital archive, inevitably, have dedicated infrastructure and people to take care of their backup needs.

Individuals, though, normally have much less. Yes, there is plenty of software that can take backups both free and commercial. Also, most OSes have some kind of in-house backup-restore utility. However, their user-friendliness and their compatibility across different platforms or, even, major OS versions is not guaranteed.

Even if a user chooses to stick to the same backup solution (which could be something as simple as a plain file copy from one disk to another) there is the challenge of the medium suitability and durability. Anyone who has been using a PC for more than 10 years is likely to have used floppy disks and/or ZIP drives and/or CDs and/or DVDs and/or external hard drives and/or flash drives for their temporary or long term backup. The problem is that some of the aforementioned media are not readily supported by a modern PC, e.g., modern PCs have neither 5¼'' drives to read the old floppies, nor parallel ports to support the original ZIP drives.

In order to be on the safe side, a user keen on archiving should, from time-to-time, migrate data from one medium to another. This is a very tedious tasks, especially if a large number of storage media is involved but let's assume that it is reasonably feasible.

The ultimate challenge is compatibility across file formats and program versions. Common formats that adhere to widespread standards are normally on the clear. Image files, for instance, such as JPEG or GIF or BMP have a long history, so files created decades ago will be displayed by virtually all modern software. The opposite doesn't necessarily apply, i.e., newer versions isn't possible to be displayed by legacy software. When it comes to formats for files not-so-frequently exchanged, however, compatibility may be an issue. Take e-mail files, for instance. Different e-mail clients tend to store e-mail in different structures. Nowadays, where e-mail clients are part of the OS, things tend to be clearer, though a few years ago there was considerably higher fragmentation (e.g., different format for Eudora, Netscape/Unix, Outlook express, Outlook, Pegasus mail, etc.). In fact, today, a large portion of our e-mail stays in the cloud, which sort of solves the compatibility problem, although it introduces a different set of challenges.

Is there a bottom line to this? Well, not really. If one needs to have data from the past, one needs to either maintain legacy hardware and software (which may or may not be possible) or put the effort to migrate the data to newer formats and media. It sounds deceivingly simple, doesn't it?

(The following video is a talk of Chad Fowler from a Scala days conference regarding 'Legacy' in software development - it is a long, not well-lit, but interesting presentation.)

No comments: