Rethinking Personal Digital Archiving Part 2: Implications for Services, Applications, and Institutions

Search | Back Issues | Author Index | Title Index | Contents

D-Lib Magazine
March/April 2008

Volume 14 Number 3/4

ISSN 1082-9873

Rethinking Personal Digital Archiving Part 2

Implications for Services, Applications, and Institutions

Catherine C. Marshall
Microsoft Research, Silicon Valley
<cathymar@microsoft.com>

1. Introduction

In Part 1 of this article, I laid out a space of challenges that we must overcome to ensure that we retain our digital assets over time and through changes in computing platforms and digital technologies. When they are set out this way, none of these challenges are surprising. Yet taken together, they suggest a radical revision in the way we approach personal digital archiving, and the types of services, applications, and institutions we put in motion at its behest.

For example, instead of looking at this problem as an impetus to centralize and unify, our data suggest that we conceive of personal archives as fundamentally distributed and unified primarily through a metadata store.¹ Furthermore, the data call into question assumptions about access – that desktop search will be sufficient to meet our needs over the long haul – and about keeping – that we will want to in some way keep everything we've laid our hands on over the years.²

More than overturning assumptions, what these challenges are intended to do is to refocus where we expend our energies as developers and technologists and to expand the discussion to include new services, applications, and institutions. While it is important to be able to decode, render, and interact with a lifetime's worth of digital objects, and crucial to develop repositories that are trusted, robust storage for these digital objects, it is not enough to stop there.

In this portion of the article, I explore the implications of the four challenges presented in Part 1 – (1) accumulation, (2) distribution, (3) digital stewardship, and (4) long-term access – and discuss (at least in a preliminary, superficial way) some promising technological directions and requirements for each. In essence, we need to arrive at answers for some very basic questions:

What should we keep?
Where should we put it?
How should we maintain it? and finally
How will we find it again?

I then wrap up the discussion by reflecting on what it means to lose some of our digital assets and how we might think of digital archiving technologies from the not-so-lofty perch of our everyday lives.

2. What should we keep? Designating and assessing value

It is easy to accumulate digital belongings. Storage is cheap and getting cheaper. The means of recording (e.g., digital cameras, audio and video recorders, and even chat transcripts) are more available and ubiquitous than ever. Applications that allow people to individually and collaboratively create digital media – and indeed whole digital worlds – are numerous and sophisticated. At the same time, the hardware that makes this accumulation possible is getting smaller. The greatest temptation is to keep almost everything and sort it out later.

At the same time, there is still a need to feel a sense of control over one's digital stuff. Thus long term storage must provide the assurance of control. In certain situations, it is important to explicitly designate value ("I want to keep this forever") or to absolutely and finally get rid of something ("I never want to see a photo of my ex again!"). We also can see from the field data that people may harbor strong opinions about the relative merit of certain digital assets in the aggregate. For example, they may feel their email is not something they care about, or that one email account has all the valuable items and another one contains the dross.

It is simple enough to provide mechanisms to handle the extremes of the value spectrum – to designate the relatively small number of items that are valuable in a noteworthy way and to be able to care for them accordingly – and to delete other items in a way that makes them irretrievable. These mechanisms need only be recognized and available.³

Harder to address are the countless items that are in the middle, the quotidian artifacts of everyday life, the numerous photos of the kids, the correspondence with friends, the downloaded music that has provided a pleasant listening experience, and other of life's souvenirs that we generally find pleasurable to have around but don't absolutely treasure. How can we ensure that an adequate number of these things survive and are findable without overwhelming an individual or making her feel that she's lost control of what she has?

First of all, items like these are by no means stable in their value. The books we loved as children may become meaningless as we grow older; on the other hand, they may be something that we would like to pass on to our children or to browse fondly on a rainy day. Second, the cognitive load of assessing value is enormous; it's why cleaning closets (or disk drives for that matter) is such an unpleasant undertaking.

Thus archiving services and applications must be able to heuristically assess value in a way that makes intuitive sense to individuals over the years. A heuristic assessment of value may be factored into other facilities – such as a mechanism for re-encountering forgotten items – in a way such that individuals are more likely to spend their time handling, viewing, and re-encountering the digital belongings that are the most meaningful to them.

How can relative value be assessed? In the field, I have observed three types of indicators that have bearing on an individual item's value:

source. This indicator has to do with the item's provenance – where did the item come from and how did the individual get it? An item's source speaks to whether it can be replaced or not, at what cost, and how much emotional impact it carries.
actions. This indicator has to do with how a person has viewed, manipulated, or modified the item – what has been done with it after it came into the person's purview? Actions tell us how much labor and creative effort are invested in an item and may be a way of demonstrating its ultimate worth.
disposition. This indicator has to do with how and where the person has stored the item and with whom it has been shared – what did the individual ultimately do with it? The disposition of an item signals whether an individual thinks the item is worth keeping.

Table 1 shows some examples of different types of value indicators. Included in the table are also counterexamples (in red), things that people do that do not demonstrate an item's value.

*Type*	*Value Indicator*	*Example*
Source	Incoming email attachment	Single photo from person on contact list
Source	Downloaded PDF file (https)	Bank statement
Source	File originated locally (.vsd)	Visio schematic
Source	P-t-P MP3 file	Music shared via LimeWire
Action	Manually change metadata	Photo file name is changed from default
Action	Play in a media player	Listen to a song
Action	Create within application	Write a novel
Action	Transactions logged	Browser history accumulated
Disposition	Attach to email and send	Sending an attachment to oneself
Disposition	Write to external media	Write a folder to a USB key
Disposition	Upload to service	Share a photo set on Flickr
Disposition	Remove file	Drag file to trash

Table 1. Three types of value indicators with contrastive examples

It is important to notice three things about this scheme. The first is that context makes it easier to distinguish between items that are valuable and items that have simply accumulated. The fact that the photo is from someone on the individual's contact list (or even from a regular correspondent) makes the photo more apt to be valuable. That a file has been acquired via peer-to-peer sharing makes it less apt to be particularly valuable and more apt to be an opportunistic acquisition. Basing value assessment on this sort of context is similar to the way Implicit Query uses context [Teevan et al., 2005; Cutrell et al., 2006].

The second is that value accretes; something doesn't seem to be particularly valuable at the outset may have its value demonstrated as time passes. Certainly an item that is the subject of activity over a prolonged period of time – a photo that is cropped, renamed, sent as an attachment, posted to an online album, written to a backup, and shared among devices – has offered us many indications that it is valuable. As a corollary, value also may diminish. The trick is being able to tell the difference between diminished value and benign neglect: the item is still valuable; it just has been forgotten or not handled in awhile.

Finally, this scheme relies mainly on intrinsic metadata rather than extrinsic metadata. To maintain an ongoing heuristic assessment of relative value, it will be necessary to automatically capture and store new kinds of metadata instead of relying on an individual's explicit assessment of the item's value. It is unlikely that people will put effort into maintaining medium value items; witness what happens to the bulk of peoples' print photographs: They are stored in boxes, kept relatively safe, but not captioned. Intrinsic metadata may be automatically collected based on user activity, device properties, or environmental sensors (e.g., georeferencing).

Naturally intrinsic metadata will introduce new privacy concerns, since we are not accustomed to digital items keeping such close tabs on their own provenance.

3. Where should we put it? Creating a catalog of distributed stores

It has been the assumption of most archiving applications that an individual's digital assets should be centralized in a single trusted repository. Yet this centralization is a very poor match with current practice. As it is, people hedge their bets and scatter their digital belongings among online services according to the functionality offered or the audiences promised, or based on other circumstantial factors (storage limits; one account's password is remembered and another's isn't). Local storage media presents consumers with similar reasons for distributing their files – a USB drive is at hand and a CD isn't, for example, or a desktop computer has lots of available storage and a laptop doesn't.

People also exhibit varying degrees of trust for local and network stores: some people feel perfectly comfortable putting their digital belongings on a home server; others feel more secure if their digital assets are stored on a removable drive that is tucked away in a safety deposit box; and still another group is comfortable using network storage options, the "storage in the cloud" solution.

Thus it seems most urgent to create a union catalog of an individual's digital belongings without necessarily completely centralizing the bits themselves, especially at first. In a scheme like this one, as time goes on and circumstances change, items may be moved among stores. For example, suppose an individual has used a social media service to share a personal collection of photos. As time goes on, outside interest in the photos wanes. Finally the service itself is deemed not profitable and is shut down. At this point, based on the catalog data, the photos' owner can decide whether all the photos – or just the higher-value photos – need to be shifted into a different store – or an actual archive – for retention. The catalog may reveal that there are enough other active copies that it's not necessary to do anything. Others subscribing to these photos may also want to know they are disappearing and may want to move their own copies to a safer place.

According to study informants, they may have a variety of reasons for using one service over another, and these reasons shift over time. For example, I asked an art student why she had started to store her animations on YouTube in addition to the other social media sites she was already using. She said:

"[I use] youtube because people always tell me that they don't feel like downloading my quicktime files from archive.org. But youtube isn't for backup though."

When I asked her why she had a Live Spaces site in addition to her other web logs, she told me that it had to do with both audience (other Taiwanese people) and functionality (design constraints):

"Because in Taiwan people always use msn spaces. And then early on msn spaces didn't let you change colors. You can choose from a template only."

It is not uncommon for people to articulate similarly complicated rationales for why they have put some digital belongings in one place, and others in another.

Storage decisions are often based on the functionality, security, and access offered by external sites. For example, Web email accounts are set up so that it is difficult for most users to store and view their email locally; they also enjoy the ubiquitous access to their email. Some services – online access to bank accounts, for example – promise safe storage of personal records. Copyright concerns may also cause individuals to rely on external stores like digital libraries.

Naturally, most people also have a number of distributed data stores on computers, devices, and removable media that belongs to them or to their friends and families. Files are replicated among these local devices and stores for many reasons too: A person may want to work on a document on a PC and on a laptop; she may back up this week's work on a USB key; she may make a digital mix tape for a friend; and she might burn photos to a CD for safekeeping. Just as with online applications and services, archiving is a side-effect.

Because archiving is a side-effect and not a carefully thought out strategy, people engage in circular reasoning: my photos have been uploaded to a photo sharing site, so there's no reason to back up the local copies; and the originals are on the hard drive, so there's no need to safeguard the copies on the photo sharing site [Marshall et al., 2007].

Given the diverse purposes for replication, it is easy to see that all copies aren't equal. Photos that have been uploaded to a social networking site to be shared may be at a different resolution than photos written to a CD for backup. An individual might copy files back and forth from a laptop to a desktop computer to work on them in both places; these copies may be in different states of completion. It is difficult to generalize which copy is authoritative. In the case of a photo, it may be the original as it was taken off of the camera's memory before any color correction has been undertaken. In the case of a novel, it may be latest version of the file.

These two properties of distributed storage – different copies are put different places for different reasons, and all of these copies are not equal – suggest two important requirements on a global catalog. First, it needs to keep track of all of the various copies of any specific item. As long as there are multiple copies being using in different ways, it is only necessary to maintain the reference to the item and the item's metadata. As copies disappear for one reason or another, it is important to know when an item of significant value is in danger. People in our study lost digital belongings not through technological catastrophe, but rather through minor negligence [Marshall et al., 2007]. An account expires and email notification fails, for example. Individuals should be able to find out where an item is and how many copies they have of it.

Second, it is vital to record the provenance of each item. It may not be the case that the last copy of a photo is going to disappear; instead it may be that the original, highest resolution copy is about to be deleted. Where did this copy come from? What has been done to it? Consider a photo that has been taken off a camera's memory card and stored in local storage in the camera's RAW format. The photographer might rotate the image 90 degrees initially to orient it correctly. Then he or she might do some simple image enhancement. Later, a sophisticated image manipulation program like PhotoShop might come into play; while the original image is not overwritten, another version in a different format is created as a result of this image manipulation. Finally, when the image is attached to an email message, its resolution might be radically reduced. Later, when the photographer is asked what to do with a copy of the image, it is crucial that he or she knows that there are other copies, where they came from, and what the differences among them are.

Today, provenance is inferred, from the filename and its variants, from other file metadata (such as file date), or from visual inspection of a rendering. For example, the following quote is from an interview with a Computer Science researcher:

"So what I do if I edit [a photo], I don't want to destroy the original, ... so I'll keep the same name and I'll put an 'e' on it, to show that it's edited. ... So look at this one. I edited it to make it black and white. And so I put bw at the end of [the filename]. So it's the same version as the one [without the bw]. In this case, I would've expected that I would have one by that name, but I don't. I only have the black and white. So I don't know what I did in this case."

Naturally, this practice is dangerous. Using a thumbnail rendition, a consumer will not always notice the difference between two very different resolutions of a photo. When files are moved from file system to file system, sometimes the dates are modified inadvertently and no longer reflect the file's actual age. As we see above, expectations may be violated in these schemes, leaving the consumer to wonder whether he or she has the real original. Finally, implicit encodings may become confusing over time; file names can only encode so much information about the file's provenance.

Much of the information about relationships among files is recorded implicitly and reconstructed by inference. Which Word document did a PDF file come from? The author will examine file dates to find out or will assume one file is derived from another because the name is the same although the file type has changed.

Current practice among computer scientists is to bundle related files together as a compressed whole at storage time. Zipping files together is not done with an eye toward recovering the small amount of wasted space; rather, it is done because otherwise the relationships are easily lost [Marshall, 2008]. A Fedora-like representation may be useful for expressing the relationships among digital objects [Lagoze et al., 2005]. Note that if this bundling is done at creation time, these relationships do not need to be inferred nor specified explicitly. Rather, they can simply be recorded at the outset.

It is clear that we need to develop better mechanisms for recording an item's provenance, maintaining it over time, and presenting it when the object is recovered from long-term storage. The systems community has made inroads in this area that can provide the basis for such mechanisms [Muniswamy-Reddy et al., 2006].

4. How should we maintain it? Curation services and mechanisms

We often assume that digital stewardship is simply a matter of storing the data once and recovering it when we would like to see it thirty years hence. In fact, there are very few fully formed cost models for digital archiving that spell out all of the aspects of maintaining a digital collection, particularly at the level of personal artifacts; the best has been developed by Mary Baker and her colleagues [Baker et al., 2006]. Most cost models focus on the initial storage of digital objects, although a few try to quantify the effects of preservation strategies such as emulation into account [Reichherzer and Brown, 2006] or to provide an operational estimate of storage costs [Moore et al., 2007].

Setting aside costs for the moment, what – at minimum – does maintaining a digital store entail? It seems productive to divide curation in three different kinds of activities: (1) invisible, routinized activities that involve every item and that can potentially be automated; (2) communal activities that take advantage of a group's, a community's, or an institution's investment in keeping material organized, labeled, and culled; and (3) individual activities that we can neither automate nor distribute among community members (i.e., everything else). I discuss each briefly, acknowledging that all of these are worthy of a significant amount of attention in their own right.

Invisible, automated, per-file maintenance. What kinds of invisible curation activities must we take into account for a viable personal archiving service? These are per-object activities that may be automated, activities that may be the target of specific services and functionality. Some of them are basic IT functions that address observed problems in consumer computing environments, such as detecting and removing virus infections; others are specific preservation activities identified by researchers and practitioners of digital archiving. Functionality includes:

Virus and malware checks at deposit;
Regular refreshing of storage media;
Any initial canonicalization of file formats [Lynch, 1999]; and
Periodic migration of uncanonicalized files to keep file formats up to date

Note that these kinds of regular, basic, per-file curation activities are best performed with minimal human intervention; our past observations tell us that IT is often performed in a very ad hoc way with whatever help is locally available. A for-pay service may be the most realistic way of implementing the automatic performance of these regular IT functions that must touch every file and that may change dramatically over time.

Virus and malware checks are an important part of an automated service; even scholarly archives are so afflicted [Adams, 2006]. It is not unusual for consumers' computers to become infected by the latest viruses and malware; often these infections are undetected until their effect is truly catastrophic. At that point, individuals more or less throw up their hands, as with this recent study participant:

"The conundrum that I'm in is like in order to back anything up on this computer, the computer has to be working well, and in order to get the computer working well, I should have backed up everything on this computer. D'ya know what I'm saying?"

Conservatively estimated, 68% of home Internet users had experienced some problem caused by malware in the year ending July, 2005 [Fox, 2005]. Furthermore, the trend is that viruses and malware are on the increase, so virus-checking is an uncontroversial component of regular automated curation.

Similarly, refreshing storage media is also an uncontroversial part of a regular curation regimen. Unless there is a drastic change in storage technology, given a predictable mean time to data loss, files will need to be moved to new storage media regularly to ensure reliability [Gray and van Ingen, 2005].

One of the trickiest aspects of digital stewardship is maintaining the files in a form that is appropriate for anticipated future use [Levy, 1998]; to be cost-effective, curation should preserve the salient properties that are necessary for future use, without expending the extra effort needed for full emulation, as for example would be required by the scheme suggested by Rothenberg [Rothenberg, 1995]. It may help to anticipate these four contrasting cases:

The file will be used primarily by software (that is, further processing will be performed on it, as with raw financial data);
The file will be rendered and viewed by a person;
The file will be used in an interactive manner (and if so, if it will involve other related files, as it would with a web site); or
The file will be modified as a result of use (as it would if a creative process were resumed).

Table 2 summarizes some potential strategies for moving individual digital objects forward through time.

	process	view	interact	change
*salient properties*	file may be read and processed by software	file may be rendered for a human viewer	file may be interacted with as created	file may be edited as a continuation of the creative process
*example*	financial data used by analysis software	email and other personal documents; personal records	personal web site as HTML files + scripts	creative work such as a Photoshop file
*curation strategy*	self-describing XML formats that capture context	canonicalization; appropriate format choices (e.g. PDF/A)	may require emulation to capture full interactive functionality	migration; may require emulation if appropriate application no longer exists

Table 2. Strategies for moving digital objects forward through time

Because canonicalization or migration may be loss-y and storage curves suggest that we may be able to keep the original files as a hedge against inadvertent loss, it will probably be the most prudent to package the original digital object along with any canonicalized or migrated versions. It is always possible that we will get it wrong during any initial format normalization or during any one of a number of migrations as the years go forward. Retaining the original digital object also leaves the door open for future use of emulation or resuscitating the original platform to recover items of sufficiently high value (for example, interactive artworks). Retaining the original digital object may also be necessary from the standpoint of provenance.

Emulation is frequently cited as the most true-to-intention method of recovering old files; however emulation is difficult and costly, mostly due to unforeseen changes in document elements such as fonts and codecs [Reichherzer and Brown, 2006] and other aspects of the platform that are external to the application. That said, emulation is a good example of an on-demand preservation service; there are cases in which emulation will provide the highest fidelity version of a particular digital artifact. A service – and possibly access to human support – is likely to make more sense than the tremendous up-front investment such as that implicit in Lorie's Universal Virtual Computer representation of self-rendering objects [Lorie, 2002].

Communal maintenance. As social software and digital libraries mature, we are beginning to realize some economies of scale; neither individuals nor institutions are obligated to go it alone to maintain their digital belongings.

This has always been true to a certain extent with our physical belongings. Often one member of a family is the de facto historian, maintaining the photo album, labeling individual pictures, recording and organizing the home movies, or keeping the box of 'treasures' safe and accessible for the rest of the family. As software like Flickr has so aptly demonstrated, there is no reason this shouldn't continue to be the case in the digital world.

There are, however, some challenges associated with communal maintenance of digital belongings. Digital curation requires unrelated types of expertise: one often has to be an expert in technology matters as well as family matters. Unfortunately, these types of expertise may not be embodied in the same person; for communal curation to be effective, it must be designed for the individual with expertise in the subject matter. It should also be easy for trusted family members to add annotations and metadata, while the curator retains ultimate authority.

On a larger scale, maintaining archives of consumer assets calls for a partnership among libraries, publishers, non-profits, and software and Internet services companies to develop a sense of what cultural stewardship means. Workable copyright policies need to be developed so that people can either maintain their own collections or refer to copies of published material outside of their personal stores that may be re-retrieved upon demand. Constraints introduced by patents and proprietary formats will also need to be addressed before we have any hope of bringing digital materials forward in time; the Library of Congress format registry is a start on this [Arms and Fleischhauer, 2005]. Trust and security interests need to be balanced. Finally, at the heart of communal collection maintenance, a financially sustainable enterprise must be created to form its backbone.

Individual maintenance. At some point, individuals simply must be involved with the maintenance of their own stuff. At the heart of any personal archiving endeavor lies the individual who has the judgment to say what's important and what's not and sufficient desire to keep his or her assets that he or she is willing to perform minimal curatorial duties.

Security is one of the most troubling aspects of maintenance. How much security is enough? How much is too much? We have already seen that one way people lose digital materials is that they simply lose access to them: they lose accounts and passwords. They don't remember to login to an account to keep it current, or they don't remember how to login to an account. In the future, it is reasonable to expect encryption keys to be lost and other security techniques to be applied inappropriately or with way too much zeal.

Thus an important part of digital stewardship is maintaining account access information – where collections are stored and how to get into them. This access information creates a point of vulnerability, but it is a vulnerability that already exists: our studies have shown that people maintain this information in ways such as emailing it to themselves or keeping it centralized on a password-protected, but easily accessible, account. For example, one study participant who was fairly sophisticated in digital matters told us:

If I have a password, an impossible-to-remember password, instead of writing it on a post-it note, and posting it on my desk, I'll literally send an email to myself with that password. And then if I ever need it, I'll just go to gmail and just type in 'special password', and it'll pop right up.

Another, a computer science researcher, said:

I store all sorts of stuff in here [his fastmail account] having to do with my personal life and my personal – you know, things I've bought, passwords for accounts.

More than five years ago, Microsoft attempted to solve this problem with Passport, but that product was not well-received by the market. Yet if such a capability were folded into a personal digital archive, it would address a very real problem in a way consistent with current practice.

Finally, it should come as no surprise that curatorial tools should take advantage of distinctions in genre. Photos should be tended as photos, records as records, and movies as movies. Curatorial tools and standard digital formats should vary with genre and with the individual's commitment to maintaining the collection. This variability in curatorial requirements supports a model of distributed storage and access – it is likely that different storage venues will provide different tools for managing the material stored there.

5. How will we find it again? New access modes

It is absolutely clear that ordinary desktop search and file browsing will be insufficient to support long term access to personal materials. New modes of access will need to be developed to tackle the problems of accumulated personal digital assets.⁴ Some of the most conspicuous of these problems stem from the fact that it is so easy to forget what we have, let alone remember where we put it. Not only is it normal to forget individual items such as specific blog posts or email messages; it is also not unusual to forget about entire collections [Marshall, 2007]. Other difficulties arise from the standard practice of replicating items, both for backup and for specific purposes (for example, to share them); as we have discussed earlier in this article, as time passes, it is difficult to know where the highest resolution version of a photo is or whether the photo has been somehow modified.

Specific capabilities that we have anticipated as necessary – and that have been presaged earlier in this discussion – include:

Circumscribed, stable digital places
Digital geographies
Venues for re-encounter
Visualization tools
Tools for finding and choosing among duplicates; and
Application-independent viewers

Circumscribed, stable digital places. One thing the personal computer and subsequent services have failed to offer is a stable sense of digital place. What I mean is the digital equivalent to the box under the bed or the footlocker in the guest-room closet or the safety deposit box at the bank or even (at the extreme) the bomb shelter in the backyard – a place where valuables are kept. You have no need to remember what's in the box, just that it's where the valuables are.

Digital geographies. Social software such as Second Life⁵ create 3D digital worlds, but they are not places where one can store one's digital belongings (other than belongings created within-world). Yet there is something compelling about knowing where all of your collections are with respect to each other. This needn't be an actual geography, but rather a set of digital places that form a virtual geography, a mode of remembering where things are. Digital geographies may follow in a long tradition of mind maps and their ilk [Churchill and Ubois, 2008]. They should permit a sense of differentiated places and permanent landmarks [Ringel et al., 2003].

Venues for re-encounter. Creating venues for re-encounter of lost or forgotten material is part and parcel of supporting access to personal digital archives. Because it is not unusual for people to forget what they have (even if it is valuable) or misremember its salient characteristics, even the best desktop search engine will not meet the requirements of long term retrieval. Essentially what is needed is a collection of methods for reminding oneself of both the high-value things that one has – the things people declare that they want to 'save forever' – and the medium-value items that represent the extent of one's digital belongings. The MyLifeBits project has used a screen saver to support photo re-encountering [Gemmell et al., 2006]. Implicit Query may also provide a mechanism for suggesting forgotten items [Cutrell et al., 2006]. But re-encounter is by no means straightforward: if an individual is doing intellectual work, he or she may not want to be interrupted by sentimental material, no matter how evocative it is (or perhaps, especially if it's evocative). Similarly, not all fodder for re-encounter is appropriate in all circumstances; many people have photos and email that they would consider embarrassing.⁶ Users would need to control the circumstances under which such material re-appeared.

Visualization tools. There have already been many efforts at creating visualizations to support browsing of extensive collections of personal material [Perer et al., 2006]. Many of the most effective ones use intrinsic properties of the material such as chronology or geographical location [Graham et al., 2002]. There's considerable room for creativity in creating visualizations. Browsing methods will generally take advantage of genre distinctions, but there is also a need for methods that create time-based visualizations that cut across different media types [Ringel et al., 2003].

Tools for finding and choosing among duplicates. One of the key advantages of digital artifacts is that they can be copied and changed so easily. Many archiving strategies use digital copies as the methodological linchpin to safe storage [Maniatis et al., 2005]. People make copies for other reasons: to share stuff, to use it in different ways, to deliberately include it in multiple collections, to keep an original when modifications are made, and to protect oneself against what computer scientists refer to as 'fat-fingering', accidental deletion of wanted material. Thus, what we mean when we refer to copies may not be actual copies – bit-by-bit duplicates – but rather copies at different fidelities (less resolution, in the case of photos, for example). It is important to be able to get from one copy to the others and to identify which copy is the ground truth, the photograph as taken, the original video footage, or another form of reference copy.

Application-independent viewers. There has been considerable discussion over the years of emulation, of reconstructing an entire computing platform in order to completely regain the original capabilities of the editor used to create a document. Much of the time, this will not be necessary. However, it will be necessary to have viewers that will display material independent of the original application and its functionality. Instead, what will be needed are viewers that allow users not only to examine content, but also inspect the item's provenance and understand why we thought the item was valuable. The ability to display meaningful reduced representations of items – for example, thumbnails of visual material – can help a user go through material more quickly. If you want to convince yourself of this, try going through Flickr's photos in the multiple available representations; it's easy to see how the ability to fit a large number of photos onto a single page can help with a visual search task.

6. Conclusion

It is easy to fall into various traps when we talk about personal digital archiving: that it's not a problem, that it's not my problem, that it's a problem of data and media formats, that we should just keep everything and worry about it later, that we should hop to it and build a comprehensive library of emulators or encode every digital object so it knows how to render itself, or that it's simply intractable and pointless to even think about personal digital archiving, and we should let the bits fall where they may.

In some sense, these aren't exactly traps. It's necessary to solve some subparts of the larger whole, to standardize formats and support format registries (for example, see [Abrams, 2005]), to be able to emulate disappearing platforms, to understand cost factors, and to measure the lifespan of digital media [Youket and Olson, 2007]. But it's also necessary to acknowledge that these solutions don't constitute the whole problem.

That people lose files in predictable ways – apart from formats becoming obsolete or catastrophes striking storage devices and media – suggests that we begin to investigate other approaches to digital archiving. It's surprising that so many people lose important digital material through everyday benign neglect: accounts are deleted because they haven't been accessed recently; whole services disappear (sometimes without warning) because of a small company's unsustainable business model; and removable storage media is simply misplaced. These are ordinary situations and shouldn't be catastrophic.

For example, consider this podcaster's plight. She stored an extensive series of podcasts – material she'd put considerable effort into creating – on a service that subsequently went bust⁷:

"i hosted my podcasts early on on a free service called Rizzn.net (the owner's personal site is rizzn.com, and his podcast hosting was rizzn.net). he then changed rizzn.net to something called blipmedia.com... and then!! he decided to sell blipmedia (or something.. it's on wikipedia) and he never emailed people about it.. suddenly the files were gone and the only news i heard about it was when i had to hunt online for what happened... and in blipmedia's google help group it was only when people ASKED HIM ABOUT IT that he explained... i thought i burned backups to a cd but now i can't find them so my audio files from Jan. 2005 to about mid 2006 were gone... so lame."

So lame indeed. Usually there's more notification than this user received, but often the warning of such a service interruption goes to an out-of-date email address or an email account that is seldom accessed and not necessarily carefully monitored or read.

On one hand, it is normal and perhaps even necessary to lose a certain amount of one's digital stuff to the forces of benign neglect. If we were able to keep everything (as many computer scientists propose), would we ever want to go through this unmanageable accumulation, even if it were filtered sensibly? In the physical world, benign neglect slowly but surely prunes our belongings. And it's almost painless. We move a few times, lose a few boxes, feel the pangs of regret, but in the end, we're left with a lighter load.

Don't we want to lose some of the heavy burden of our own history?

On the other hand, we don't want to end up with great unfillable gaps in our personal record. Digital loss has a tendency to be an all-or-nothing proposition. People don't lose just a few of the baby pictures of their first child; they lose ALL of them.

What we want then is a combination of services and mechanisms that will make it possible to designate which of our digital things are the most valuable; to organize the rest of them into tractable archives that reflect the items' value; and to not spend all kinds of extra time taking care of them. It's best not to become a slave to one's own stuff.

Personal archiving technology should fit organically into everyday practice: it should take advantage of the fact that increasingly we're storing stuff online, on social media sites, in blogging tools, on web sites, in online banking systems, in medical records repositories, and so on. We won't stop doing this because we have an archiving system at our disposal, nor should we. These other places have capabilities and audiences that are not replaceable by a single, centralized digital store.

While it is seductive to envision a single venue – storage in the cloud – to be the repository of everything we care about, it is more realistic to acknowledge that once people have made a few copies of a treasured item, they feel reasonably secure about its fate. For example, when participants describe storing photos on Flickr, they no longer see a reason to create additional backups of the files:

"The good thing about the photos is that there's always an intermediary step. I mean, like the photos go off of my camera onto my computer before they go up to Flickr. So I always have master copies on my PC. So that's why I don't care so much about Flickr evaporating."

Why – once the photos are shared and tucked away locally – would an individual take the time and bandwidth (or possibly bear the expense) to put files in yet another place?

The answer is straightforward: it is more important to know what we have and where we've put it than it is to centralize all of our stuff into a single repository.

Many reflective thinkers of our time have warned of a coming digital dark ages [Kuny, 1998]. This may seem too dramatic to describe what actually seems to be happening. Instead, we need to be mindful of the quotidian pleasures of coming upon a bundle of old love letters (which these days might correspond to some Facebook flirtation), a dog-eared photo of college friends at the beach, a short story written a decade ago and tucked away.

We want to be sure such pleasures remain possible and within our reach.

Acknowledgments

I'd like to thank Catharine van Ingen for many helpful discussions about storage, media, and archiving and Sara Bly and Francoise Brun-Cottan for invaluable fieldwork assistance and data analysis help. Michael Nelson and Frank McCown (and their terrific Warrick "Lazy Preservation" application) were the main impetus for finding out about lost websites. Will Manis and Jeff Ubois have both been great sounding boards for these ideas as they've taken shape. Thanks too to Doug Terry and the CIM project.

Notes

1. This is not unlike the trend in institutional repositories: instead of believing that scholars will deposit their publications and datasets in multiple repositories, or in one central repository, we have come to realize that it is more effective to federate at the metadata level.

2. This is an assumption that is technology driven: that because we can keep everything (see [Santry et al., 1999]), we should keep everything. Under this regime, even deletion is seen as simply "setting the deleted bit" to make the item invisible.

3. They may, in fact, correspond to a repository of the sort that dominates the archiving literature.

4. Tori Orr has written a fascinating literature review that covers how biographies and social histories might be indexed to facilitate retrieval (see Orr, 2004).

5. <http://secondlife.com/>.

6. From home visits, I know that most people have naughty photos on their computers, even if the person doesn't seem to be "the type".

7. The quote is taken from a Skype IM transcript. I have eliminated time stamps and breaks for readability. The double periods are literally in the transcript; the triple periods represent elided text.

References

[Abrams, 2005] S. Abrams, "Establishing a Global Digital Format Registry." Library Trends 54.1 (2005) 125-143. <http://muse.jhu.edu/demo/library_trends/v054/54.1abrams.html>.

[Adams, 2006] G. Adams, "Beyond OAIS." Proceedings of Archiving 2006. (Ottawa, Canada, May 23-26, 2006), Springfield, VA: Society for Imaging Science and Technology, p. 7. <http://www.imaging.org/store/epub.cfm?abstrid=33617>.

[Arms and Fleischhauer, 2005] C. Arms and C. Fleischhauer, "Digital Formats: Factors for Sustainability, Functionality, and Quality." Proceedings of IS&T Archiving 2005 (Washington, DC, May 23-26), Society for Imaging Science and Technology, Springfield, VA, 2005. <http://memory.loc.gov/ammem/techdocs/digform/Formats_IST05_paper.pdf>.

[Baker et al., 2006] M. Baker, M. Shah, D.S. Rosenthal, M. Roussopoulos, P. Maniatis, T. Giuli, and P. Bungale, "A fresh look at the reliability of long-term digital storage," In Proceedings of Eurosys 2006. pp. 221-234. <http://doi.acm.org/10.1145/1217935.1217957>.

[Churchill and Ubois, 2008] E. Churchill and J. Ubois, "Designing for Digital Archives," Interactions, 15, 2 (March/April, 2008). <http://doi.acm.org/10.1145/1340961.1340964>.

[Cutrell et al., 2006] E. Cutrell, S. Dumais, and J. Teevan, "Searching to Eliminate Personal Information Management." Communications of the ACM, 49 (1): 58-64 (Jan. 2006). <http://doi.acm.org/10.1145/1107458.1107492>.

[Fox, 2005] S. Fox, Spyware: The threat of unwanted software programs is changing the way people use the internet. Report published by the Pew Internet & American Life Project, 6 July 2005. <http://www.pewinternet.org/PPF/r/160/report_display.asp>, accessed 12 December, 2007. <http://www.pewinternet.org/pdfs/PIP_Spyware_Report_July_05.pdf>.

[Gemmell et al., 2006] J. Gemmell, G. Bell, and R. Lueder, "MyLifeBits: a personal database for everything." Communications of the ACM, 49 (1): 88-95. <http://doi.acm.org/10.1145/1107458.1107460>.

[Graham et al., 2002] A. Graham, H. Garcia-Molina, A. Paepcke, and T. Winograd, "Time as Essence for Photo Browsing Through Personal Digital Libraries," Proceedings of JCDL 2002 (Portland, OR, July 14-18 2002), pp. 326-335. <http://doi.acm.org/10.1145/544220.544301>.

[Gray and van Ingen, 2005] J. Gray and C. van Ingen, "Empirical Measurements of Disk Failure Rates and Error Rates." Microsoft Technical Report MSR-TR-2005-166, Microsoft Research, December, 2005. <http://research.microsoft.com/research/pubs/view.aspx?msr_tr_id=MSR-
TR-2005-166>.

[Kuny, 1998] T. Kuny, "A Digital Dark Ages? Challenges in the Preservation of Electronic Information," International Preservation News, 17 (May 1998). <http://www.ifla.org/IV/ifla63/63kuny1.pdf>.

[Lagoze et al., 2005] C. Lagoze, S. Payette, E. Shin, and C. Wilper, "Fedora: An Architecture for Complex Objects and their Relationships." International Journal of Digital Libraries: Special Issue on Complex Objects, Springer 2005. Available online at <http://arxiv.org/abs/cs.DL/0501012>.

[Levy, 1998] D.M. Levy, "Heroic measures: reflections on the possibility and purpose of digital preservation." Proceedings of DL'98, pp. 152-161 (1998). <http://doi.acm.org/10.1145/276675.276692>.

[Lorie, 2002] R. Lorie, "A Methodology and System for Preserving Digital Data." In Proceedings of JCDL'02 (Portland, Oregon, July 14-18, 2002). ACM Press, New York, NY, 2002, pp. 312-319. <http://doi.acm.org/10.1145/544220.544296>.

[Lynch, 1999] C. Lynch, "Canonicalization: A fundamental tool to facilitate preservation and management of digital information." D-Lib Magazine, 5, 9 (September 1999). <doi:10.1045/september99-lynch>.

[Maniatis et al., 2005] P. Maniatis, M. Roussopoulos, T. Giuli, D. Rosenthal, M. Baker, and Y. Muliadi, "LOCKSS: A peer-to-peer digital preservation system." ACM Transactions on Computer Systems, 23(1): 2-50, Feb, 2005. <http://doi.acm.org/10.1145/1047915.1047917>.

[Marshall, 2008] C.C. Marshall, 2008. From Writing and Analysis to the Repository: Taking the Scholars' Perspective on Scholarly Archiving. To appear in Proceedings of JCDL 2008. (Pittsburgh, PA, June 16-20, 2008), New York: ACM Press.

[Marshall, 2007] C.C. Marshall, "How People Manage Personal Information over a Lifetime." In Personal Information Management (Jones and Teevan, eds.), University of Washington Press, Seattle, Washington, 2007, pp. 57-75. <http://www.csdl.tamu.edu/~marshall/PIM%20Chapter-Marshall.pdf>.

[Marshall et al., 2007] C.C. Marshall, F. McCown, and M.L. Nelson, "Evaluating Personal Archiving Strategies for Internet-based Information." Proceedings of Archiving 2007, Arlington, Virginia, May 21-24, 2007, Society for Imaging Science and Technology, Springfield, VA, 2007, pp. 151-156. <http://arxiv.org/abs/0704.3647>.

[Moore et al., 2007] R.L. Moore, J. D'Aoust, R.H. McDonald, and D. Minor, "Disk and Tape Storage Cost Models." Proceedings of Archiving 2007, Arlington, Virginia, May 21-24, 2007, Society for Imaging Science and Technology, Springfield, VA, 2007, pp. 29-32. <http://www.imaging.org/conferences/archiving2007/details.cfm?pass=21>.

[Muniswamy-Reddy et al., 2006] K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer, "Provenance-Aware Storage Systems." In Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, June 2006. <http://www.usenix.org/events/usenix06/tech/full_papers/muniswamy-reddy/muniswamy-reddy.pdf>.

[Orr, 2004] T. Orr, "Review of Literature: Representing Personal Histories in a Social Context," Unpublished manuscript, dated June, 2004.

[Perer et al., 2005] A. Perer, B. Shneiderman, and D. W. Oard, 2006, "Using Rhythms of Relationships to Understand Email Archives," Journal of the American Society for Information Science and Technology, 57, 14, pp. 1936-1948. <doi:0.1002/asi.20387>.

[Reichherzer and Brown, 2006] T. Reichherzer and G. Brown, "Quantifying Software Requirements for Supporting Archived Office Documents using Emulation." Proceedings of JCDL'06 (2006), pp. 86-94. <http://doi.acm.org/10.1145/1141753.1141770>.

[Ringel et al., 2003] M. Ringel, E. Cutrell, S. Dumais, E. Horvitz, "Milestones in time: The value of landmarks in retrieving information from personal stores." In Proceedings of Interact 2003, pp. 228-235 (2003).

[Rothenberg, 1995] J. Rothenberg, "Ensuring the Longevity of Digital Documents." Scientific American (Jan 95), 42-47. <http://www.clir.org/pubs/archives/ensuring.pdf>.

[Santry et al., 1999] D. J. Santry, M. J. Feeley, N. C Hutchinson, and A. C. Veitch, "Elephant: The file system that never forgets," In Workshop on Hot Topics in Operating Systems, pp. 2-7, 1999. <doi:10.1109/HOTOS.1999.798368>.

[Teevan et al., 2005] J. Teevan, S. T. Dumais, and E. Horvitz (2005), "Personalizing search via automated analysis of interests and activities," In Proceedings of SIGIR 2005, 449-456. <http://doi.acm.org/10.1145/1076034.1076111>.

[Youket and Olson, 2007] M. Youket and N. Olson, "Compact Disc Service Life Studies by the Library of Congress," In Proceedings of Archiving 2007, Arlington, Virginia, May 21-24, 2007, Society for Imaging Science and Technology, Springfield, VA, 2007, pp. 99-104.

D-Lib Magazine Access Terms and Conditions

doi:10.1045/march2008-marshall-pt2

D-Lib MagazineMarch/April 2008

Volume 14 Number 3/4 ISSN 1082-9873