I'm David Rosenthal from the LOCKSS Program at the Stanford University Libraries, which last October celebrated its 15th birthday. The demands of LOCKSS and CLOCKSS mean that I won't be able to do a lot more work on the big picture of preservation in the near future. So it is time for a summing-up, trying to organize the various areas I've been looking at into a coherent view of the big picture.
How Well Are We Doing?
To understand the challenges we face in preserving the world's digital heritage, we need to start by asking "how well are we currently doing?" I noted some of the attempts to answer this question in my iPRES talk:
- In 2010 the ARL reported that the median research library received about 80K serials. Stanford's numbers support this. The Keepers Registry, across its 8 reporting repositories, reports just over 21K "preserved" and about 10.5K "in progress". Thus under 40% of the median research library's serials are at any stage of preservation.
- Luis Faria and co-authors (PDF) compare information extracted from publisher's web sites with the Keepers Registry and conclude:
We manually repeated this experiment with the more complete Keepers Registry and found that more than 50% of all journal titles and 50% of all attributions were not in the registry and should be added.
- Scott Ainsworth and his co-authors tried to estimate the probability that a publicly-visible URI was preserved, as a proxy for the question "How Much of the Wed is Archived?" They generated lists of "random" URLs using several different techniques including sending random words to search engines and random strings to the bit.ly URL shortening service. They then:
- tried to access the URL from the live Web.
- used Memento to ask the major Web archives whether they had at least one copy of that URL.
URIs from search engine sampling have about 2/3 chance of being archived [at least once] and bit.ly URIs just under 1/3.
An Optimistic Assessment
First, the assessment isn't risk-adjusted:
- As regards the scholarly literature librarians, who are concerned with post-cancellation access not with preserving the record of scholarship, have directed resources to subscription rather than open-access content, and within the subscription category, to the output of large rather than small publishers. Thus they have driven resources towards the content at low risk of loss, and away from content at high risk of loss. Preserving Elsevier's content makes it look like a huge part of the record is safe because Elsevier publishes a huge part of the record. But Elsevier's content is not at any conceivable risk of loss, and is at very low risk of cancellation, so what have those resources achieved for future readers?
- As regards Web content, the more links to a page, the more likely the crawlers are to find it, and thus, other things such as robots.txt being equal, the more likely it is to be preserved. But equally, the less at risk of loss.
- A similar problem of risk-aversion is manifest in the idea that different formats are given different "levels of preservation". Resources are devoted to the formats that are easy to migrate. But precisely because they are easy to migrate, they are at low risk of obsolescence.
- The same effect occurs in the negotiations needed to obtain permission to preserve copyright content. Negotiating once with a large publisher gains a large amount of low-risk content, where negotiating once with a small publisher gains a small amount of high-risk content.
- Similarly, the web content that is preserved is the content that is easier to find and collect. Smaller, less linked web-sites are probably less likely to survive.
Third, the assessment is backward-looking:
- As regards scholarly communication it looks only at the traditional forms, books and papers. It ignores not merely published data, but also all the more modern forms of communication scholars use, including workflows, source code repositories, and social media. These are mostly both at much higher risk of loss than the traditional forms that are being preserved, because they lack well-established and robust business models, and much more difficult to preserve, since the legal framework is unclear and the content is either much larger, or much more dynamic, or in some cases both.
- As regards the Web, it looks only at the traditional, document-centric surface Web rather than including the newer, dynamic forms of Web content and the deep Web.
Fourth, the assessment is likely to suffer measurement bias:
- The measurements of the scholarly literature are based on bibliographic metadata, which is notoriously noisy. In particular, the metadata was apparently not de-duplicated, so there will be some amount of double-counting in the results.
- As regards Web content, Ainsworth et al describe various forms of bias in their paper.
- Books used to be published through well-defined channels that assigned ISBNs, but now e-books can appear anywhere on the Web.
- YouTube and other sites now contain vast amounts of video, some of which represents what in earlier times would have been movies.
- Much music now happens on YouTube (e.g. Pomplamoose)
- Scientific data is exploding in both size and diversity, and despite efforts to mandate its deposit in managed repositories much still resides in grad students laptops.
Looking Forward
Each unit of the content we are currently not preserving will be more expensive than a similar unit of the content we currently are preserving. I don't know anyone who thinks digital preservation is likely to receive a vast increase in funding; we'll be lucky to maintain the current level. So if we continue to use our current techniques the long-term rate of content loss to future readers from failure to collect will be at least 50%. This will dwarf all other causes of loss.
If we are going to preserve the other more than half, we need a radical re-think of the way we currently work. Even ignoring the issues above, we need to more than halve the cost per unit of content.
I'm on the advisory board of the EU's "4C" project, which aims to pull together the results of the wide range of research into the costs of digital curation and preservation into a usable form. My rule of thumb, based on my reading of the research, is that in the past ingest has taken about one-half, preservation about one-third, and access about one-sixth of the total cost. What are the prospects for costs in each of these areas going forward? How much more than halving the cost do we need?
Future Costs: Ingest
Increasingly, the newly created content that needs to be ingested needs to be ingested from the Web. As we've discussed at two successive IIPC workshops, the Web is evolving from a set of hyper-linked documents to being a distributed programming environment, from HTML to Javascript. In order to find the links much of the collected content now needs to be executed as well as simply being parsed. This is already significantly increasing the cost of Web harvesting, both because executing the content is computationally much more expensive, and because elaborate defenses are required to protect the crawler against the possibility that the content might be malign.
The days when a single generic crawler could collect pretty much everything of interest are gone; future harvesting will require more and more custom tailored crawling such as we need to collect subscription e-journals and e-books for the LOCKSS Program. This per-site custom work is expensive in staff time. The cost of ingest seems doomed to increase.
The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.
Future Costs: Preservation
The major cost of the preservation phase is storage. The cost of storing the collected content for the long term has not been a significant concern for digital preservation. Kryder's Law, the exponential increase in bit density of magnetic media such as disks and tape, stayed in force for three decades and resulted in the cost per byte of storing data halving roughly every two years. Thus, if you could afford to store a collection for the next few years you could afford to store it forever, assuming Kryder's Law continued in force for a fourth decade.
- The slowing started in 2010, before the floods hit Thailand.
- Disk storage costs are now, two and a half years after the floods, more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
- If the industry projections pan out, as shown by the red lines, by 2020 disk costs will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
- It can be monetized, as with Google's Gmail service, which funds storing your e-mail without charging you by selling ads alongside it.
- It can be rented, as with Amazon's S3 and Glacier services, for a monthly payment per Terabyte.
- It can be endowed, deposited together with a capital sum thought to be enough, with the interest it earns, to pay for storage "forever".
Earlier in 2010 I started predicting, for interconnected technological and business reasons, that Kryder's Law would slow down. I expressed skepticism about Princeton's model in a talk at the 2011 Personal Digital Archiving conference, and started work building an economic model of long-term storage.
A month before the Thai floods I presented initial results at the Library of Congress. A month after the floods I was able to model the effect of price spikes, and demonstrate that the endowment needed for a data collection depended on the Kryder rate in a worrying way. At the Kryder rates we were used to, with cost per byte dropping rapidly, the endowment needed was small and not very sensitive to the exact rate. As the Kryder rate decreased, the endowment needed rose rapidly and became very sensitive to the exact rate.
Since the floods, the difficulty and cost of the next generation of disk technology and the consolidation of the disk drive industry have combined to make it clear that future Kryder rates will be much lower than they were in the past. Thus storage costs will be much higher than they were expected to be, and much less predictable.
You may think I'm a pessimist about storage costs. So lets look at what the industry analysts are saying:
- According to IDC, the demand for storage each year grows about 60%.
- According to IHS iSuppli, the bit density on the platters of disk drives will grow no more than 20%/year for the next 5 years.
- According to computereconomics.com, IT budgets in recent years have grown between 0%/year and 2%/year.
Recent industry analysts' projections of the Kryder rate have proved to be consistently optimistic. My industry contacts have recently suggested that 12% may be the best we can expect. My guess is that if your collection grows more than 10%/yr storage cost as a proportion of the total budget will increase.
Higher costs will lead to a search for economies of scale. In Cliff Lynch's summary of the ANADP I meeting, he pointed out:
When resources are very scarce, there's a great tendency to centralize, to standardize, to eliminate redundancy in the name of cost effectiveness. This can be very dangerous; it can produce systems that are very brittle and vulnerable, and that are subject to catastrophic failure.But monoculture is not the only problem. As I pointed out at the Preservation at Scale workshop, the economies of scale are often misleading. Typically they are an S-curve, and the steep part of the curve is at a fairly moderate scale. And the bulk of the economies end up with commercial suppliers operating well above that scale rather than with their customers. These vendors have large marketing budgets with which to mislead about the economies. Thus "the cloud" is not an answer to reducing storage costs for long-term preservation.
The actions of the Harper government in Canada demonstrate clearly why redundancy and diversity in storage is essential, not just at the technological but also at the organizational level. Content is at considerable risk if all its copies are under the control of a single institution, particularly these days a government vulnerable to capture by a radical ideology.
Cliff also highlighted another area in particular in which creating this kind of monoculture causes a serious problem:
I'm very worried that as we build up very visible instances of digital cultural heritage that these collections are going to become subject [to] attack in the same way the national libraries, museums, ... have been subject to deliberate attack and destruction throughout history.Edward Snowden's revelations have shown the attack capabilities that nation-state actors had a few years ago. How sure are you that no nation-state actor is a threat to your collections? A few years hence, many of these capabilities will be available in the exploit market for all to use. Right now, advanced persistent threat technology only somewhat less effective than that which recently compromised Stanford's network is available in point-and click form. Protecting against these very credible threats will increase storage costs further.
...
Imagine the impact of having a major repository ... raided and having a Wikileaks type of dump of all of the embargoed collections in it. ... Or imagine the deliberate and systematic modification or corruption of materials.
Every few months there is another press release announcing that some new, quasi-immortal medium such as stone DVDs has solved the problem of long-term storage. But the problem stays resolutely unsolved. Why is this? Very long-lived media are inherently more expensive, and are a niche market, so they lack economies of scale. Seagate could easily make disks with archival life, but they did a study of the market for them, and discovered that no-one would pay the relatively small additional cost.
The fundamental problem is that long-lived media only make sense at very low Kryder rates. Even if the rate is only 10%/yr, after 10 years you could store the same data in 1/3 the space. Since space in the data center or even at Iron Mountain isn't free, this is a powerful incentive to move old media out. If you believe that Kryder rates will get back to 30%/yr, after a decade you could store 30 times as much data in the same space.
There is one long-term storage medium that might eventually make sense. DNA is very dense, very stable in a shirtsleeve environment, and best of all it is very easy to make Lots Of Copies to Keep Stuff Safe. DNA sequencing and synthesis are improving at far faster rates than magnetic or solid state storage. Right now the costs are far too high, but if the improvement continues DNA might eventually solve the archive problem. But access will always be slow enough that the data would have to be really cold before being committed to DNA.
The reason that the idea of long-lived media is so attractive is that it suggests that you can be lazy and design a system that ignores the possibility of failures. You can't:
- Media failures are only one of many, many threats to stored data, but they are the only one long-lived media address.
- Long media life does not imply that the media are more reliable, only that their reliability decreases with time more slowly. As we shall see, current media are many orders of magnitude too unreliable for the task ahead.
Double the reliability is only worth 1/10th of 1 percent cost increase. ... Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.Future Costs: Access
The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).
Moral of the story: design for failure and buy the cheapest components you can. :-)
It has always been assumed that the vast majority of archival content is rarely accessed. Research at UC Santa Cruz showed that the majority of accesses to archived data are for indexing and integrity checks. This is supported by the relatively small proportion access forms of total costs in the preservation cost studies.
But this is a backwards-looking assessment. Increasingly, as collections grow and data-mining tools become widely available, scholars want not to read individual documents, but to ask questions of the collection as a whole. Providing the compute power and I/O bandwidth to permit data-mining of collections is much more expensive than simply providing occasional sparse read access. Some idea of the increase in cost can be gained by comparing Amazon's S3, designed for data-mining type access patterns, with Amazon's Glacier, designed for traditional archival access. S3 is currently at least 5.5 times as expensive.
An example of this problem is the Library of Congress' collection of the Twitter feed. Although the Library can afford the not insignificant costs of ingesting the full feed, with some help from outside companies, the most they can afford to do with it is to make two tape copies. They couldn't afford to satisfy any of the 400 requests from scholars for access to this collection that they had accumulated by this time last year.
Implications
Earlier I showed that even if we assume that the other half of the content costs no more to preserve than the low-hanging fruit we're already preserving we need preservation techniques that are at least twice as cost-effective as the ones we currently have. But since then I've shown that:
- The estimate of half is optimistic.
- The rest of the content will be much more expensive to ingest.
- The costs of storing even the content we're currently preserving have been underestimated.
- The access that scholars will require to future digital collections will be much more expensive than that they required in the past.
Reducing Costs: Ingest
Much of the discussion of digital preservation concerns metadata. For example, there are 52 criteria for the Trusted Repository Audit, ISO 16363 Section 4. 29 (56%) are metadata-related. Creating and validating metadata is expensive:
- Manually creating metadata is impractical at scale.
- Extracting metadata from the content scales better, but it is still expensive since:
- Considerable per-site work is needed to extract bibliographic metadata.
- Generating format metadata is computationally expensive.
- In both cases, extracted metadata is sufficiently noisy to impair its usefulness.
- When is the metadata required? The discussions in the Preservation at Scale workshop contrasted the pipelines of Portico and the CLOCKSS Archive, which ingest much of the same content. The Portico pipeline is far more expensive because it extracts, generates and validates metadata during the ingest process. CLOCKSS, because it has no need to make content instantly available, implements all its metadata operations as background tasks, to be performed as resources are available.
- How important is the metadata to the task of preservation? Generating metadata because its possible, or because it looks good in voluminous reports, is all too common. Format metadata is often considered essential to preservation, but if format obsolescence isn't happening , or if it turns out that emulation rather than format migration is the preferred solution, it is a waste of resources, and if validating the formats of incoming content using error-prone tools is used to reject allegedly non-conforming content it is counter-productive.
- Access via bibliographic (as opposed to full-text) search via, for example, OpenURL resolvers.
- Meta-preservation services such as the Keepers Registry.
- Competitive marketing.
It is becoming clear that there is much important content that is too big, too dynamic, too proprietary or too DRM-ed for ingestion into an archive to be either feasible or affordable. In these cases where we simply can't ingest it, preserving it in place may be the best we can do; creating a legal framework in which the owner of the dataset commits, for some consideration such as a tax advantage, to preserve their data and allow scholars some suitable access. Of course, since the data will be under a single institution's control it will be a lot more vulnerable than we would like, but this type of arrangement is better than nothing, and not ingesting the content is certainly a lot cheaper than the alternative.
Reducing Costs: Preservation
Perfect preservation is a myth, as I have been saying for at least 7 years using "A Petabyte for a Century" as a theme. Current storage technologies are about a million times too unreliable to keep a Petabyte intact for a century; stuff is going to get lost.
Consider two storage systems with the same budget over a decade, one with a loss rate of zero, the other half as expensive per byte but which loses 1% of its bytes each year. Clearly, you would say the cheaper system has an unacceptable loss rate.
However, each year the cheaper system stores twice as much and loses 1% of its accumulated content. At the end of the decade the cheaper system has preserved 1.89 times as much content at the same cost. After 30 years it has preserved more than 5 times as much at the same cost.
Adding each successive nine of reliability gets exponentially more expensive. How many nines do we really need? Is losing a small proportion of a large dataset really a problem? The canonical example of this is the Internet Archive's web collection. Ingest by crawling the Web is a lossy process. Their storage system loses a tiny fraction of its content every year. Access via the Wayback Machine is not completely reliable. Yet for US users archive.org is currently the 153rd most visited site, whereas loc.gov is the 1231st. For UK users archive.org is currently the 137th most visited site, whereas bl.uk is the 2752th.
Why is this? Because the collection was always a series of samples of the Web, the losses merely add a small amount of random noise to the samples. But the samples are so huge that this noise is insignificant. This isn't something about the Internet Archive, it is something about very large collections. In the real world they always have noise; questions asked of them are always statistical in nature. The benefit of doubling the size of the sample vastly outweighs the cost of a small amount of added noise. In this case more is better.
Reducing Costs: Access
The Blue Ribbon Task Force on Sustainable Digital Preservation and Access pointed out that the only real justification for preservation is to provide access. In most cases so far the cost of an access to an individual document has been small enough that archives have not charged the reader. But access to individual documents is not the way future scholars will want to access the collections. Either transferring a copy, typically by shipping a NAS box, or providing data-mining infrastructure at the archive is so expensive that scholars must be charged for access. This in itself has costs, since access must be controlled and accounting undertaken. Further, data-mining infrastructure at the archive must have enough performance for the peak demand but will likely be lightly used most of the time, increasing the cost for individual scholars.
The real problem here is that scholars are used to having free access to library collections, But what they increasingly want to do with the collections is expensive. A charging mechanism is needed to pay for the infrastructure and, because the scholar's access is spiky, the could provides both suitable infrastructure and a charging mechanism.
For smaller collections, Amazon provides Free Public Datasets, Amazon stores the, charging scholars accessing them for the computation rather than charging the owner of the data for storage.
Even for large and non-public collections it may be possible to use Amazon. Suppose that in addition to keeping the two archive copies of the Twitter feed on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. Right now it would be costing $7692/mo. Each month this would increase by $319. So a year would cost $115,272. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be borne by the library or charged back to the researchers. If they were charged back, the 400 outstanding requests would each need to pay about $300 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach.
Recently Twitter offered a limited number of scholars access to its infrastructure to data-mine from the feed, but this doesn't really change the argument.
There are potential storage technologies that combine computation and storage in a cost-effective way. Colleagues at UC Santa Cruz and I proposed one such architecture, which we called DAWN (Durable Array of Wimpy Nodes) in 2011. Architectures of this kind might significantly reduce the cost of providing scholars with data-mining access to the collections. The evolution of storage media is pointing in this direction. But there are very considerable business model difficulties in the way of commercializing such technologies.
Marketing
Any way of making preservation cheaper can be spun as "doing worse preservation". Jeff Rothenberg's Future Perfect 2012 keynote is an excellent example of this spin in action.
We live in a marketplace of competing preservation solutions. A very significant part of the cost of both not-for-profit systems such as CLOCKSS or Portico, and commercial products such as Preservica is the cost of marketing and sales. For example, TRAC certification is a marketing check-off item. The cost of the process CLOCKSS is currently undergoing to obtain this check-off item will be well in excess of 10% of its annual budget.
Making the tradeoff of preserving more stuff using worse preservation would need a mutual non-aggression marketing pact. Unfortunately, the pact would be unstable. The first product to defect and sell itself as "better preservation than those other inferior systems" would win. Thus private interests work against the public interest in preserving more content.
Conclusion
Most current approaches to digital preservation aim to ensure that, once ingested, content is effectively immune from bit-rot and the mostly hypothetical threat of format obsolescence, and that future readers will be able to access individual documents via metadata. There are four major problems with these approaches:
- Reading individual documents one-at-a-time is unlikely to be the access mode future scholars require.
- At the scale required against the threats they address, bit-rot and format obsolescence, the effectiveness of current techniques is limited.
- Against other credible threats, such as external attack and insider abuse, the effectiveness of current techniques is doubtful.
- Current techniques are so expensive that by far the major cause of future scholars inability to access content will be that the content was not collected in the first place.
5 comments:
Just a quick note about access to archive.org, our group recently published two papers on this topic:
Access patterns for robots and humans in web archives, JCDL 2013
Summary: In archive.org, robots outnumber humans 10:1 in terms of sessions, 5:4 in terms of raw HTTP accesses (b/c of images), and 4:1 in terms of megabytes transferred. 95% of robot accesses are page scraping the calendar interface (i.e., HTML TimeMaps), and 33% of human accesses hit just a single memento.
Who and What Links to the Internet Archive, TPDL 2013
Summary: Covers many things, but table 3 is the most important: for humans, if archive.org gives a 200 (i.e., page is archived) then there is a 64% chance the page is 404 on the live web; if archive.org gives a 404 (i.e., page not archived), then there is a 75% chance the page is 404 on the live web.
We had begun this line of study hoping to develop something that looked for patterns in the web archive accesses with the hope of correlating them with current events (or the other way around). For example, on the 1 year anniversary of X there would be sessions in IA about X. We couldn't find anything like that; if it's there it's well hidden. Basically, humans hit archive.org for individual 404 pages and *not* historical sessions.
The Lancet Global Health has a piece detailing the Harper government's attacks on science.
More on the Harper government's war on science here.
Another interesting piece of research from Old Dominion University. Lulwah M. Alkwai studied how well Arabic websites are preserved, and yet again concluded that about half of their sample of URIs was not archived (138968/300646 to be precise, or 46%).
While I'm here, Dave at 24%Majority summarizes the entire 4 year record of the Harper administration in 2011-2015 Harper Government Wrap-up.
Post a Comment