You can tell this is an extraordinary honor from the list of previous awardees, and the fact that it is the first time it has been awarded in successive years. Part of the award is the opportunity to make an extended presentation to open the meeting. Our talk was entitled Lessons From LOCKSS, and the abstract was:
Vicky and David will look back over their two decades with the LOCKSS Program. Vicky will focus on the Program's initial goals and how they evolved as the landscape of academic communication changed. David will focus on the Program's technology, how it evolved, and how this history reveals a set of seductive, persistent but impractical ideas.Below the fold is the text with links to the sources, information that appeared on slides but was not spoken, and much additional information in footnotes.
Introduction (Vicky)
First, we are extremely grateful to the Paul Evan Peters award committee and CNI, and to the Association of Research Libraries, EDUCAUSE, Microsoft and Xerox who endowed the award.David and I are honored and astonished by this award. Honored because it is the premiere award in the field, and astonished because we left the field more than seven years ago to take up our new full-time career as grandparents. The performance metrics are tough, but it’s a fabulous gig.
This talk will be mostly historical. David will discuss the technology's design and some lessons we learned deploying it. First I will talk about our goals when, more than a quarter-century ago, we walked into Michael Keller's office and pitched the idea that became the LOCKSS Program. Mike gave us three instructions:
- Don't cost me any money.
- Don't get me into trouble.
- Do what you want
Support | Ideas | Technology |
Michael Lesk | Karen Hunter (CLOCKSS) | Petros Maniatis |
Don Waters | James Jacobs (GovDocs) | TJ Giuli |
Michael Keller | Martin Halpert Katherine Skinner (1st PLN) | Mema Roussopolous |
Brewster Kahle | Clifford Lynch | Mary Baker |
Jim Mitchell | Jefferson Bailey | Mark Seiden |
John Sack | Gordon Tibbits | LOCKSS team |
Susan Horsfall |
We succeeded in each of these. That the program is still going is thanks to many people. Snowden Becker who is here today represents a much larger team who work tirelessly to sustain the program. Many others helped along the way. Michael Lesk then at NSF and Donald Waters then at the Mellon Foundation provided essential funding. This slide attempts to thank everyone, but we're sure we've left people out — it was a long time ago.
Background (Vicky)
Let's get started. Over the centuries libraries developed a dual role. By building collections they provided current readers with access to information. Then they exercised stewardship over these collections to safeguard future readers' access.Libraries transitioned from the print to the digital world over a couple of decades. In the mid 1980’s the Library of Congress experimented with readers accessing journals on 12-inch optical media.
Year | Month | Event |
1984 | LoC Machine Reading Room | |
1989 | Nov | First Web page |
1991 | Dec | First US Web page |
1993 | Cliff Lynch OTA report | |
1994 | Sep | Stanford Digital Library Project |
1995 | Jan | Rothenberg's SciAm article |
1995 | May | Highwire Press |
1996 | May | Internet Archive |
1998 | Sep | |
1998 | Oct | LOCKSS |
1998 | Tim Berners-Lee "Cool URIs" |
Two years later the Stanford Linear Accelerator Center put the first US Web page online and people started thinking about how this new publishing medium would impact the academy. An early effort came in 1993 when Cliff Lynch wrote a 105-page report for the federal Office of Technology Assessment.
Now, consider a library acquiring information in an electronic format. Such information is almost never, today, sold to a library (under the doctrine of first sale); rather, it is licensed to the library that acquires it, with the terms under which the acquiring library can utilize the information defined by a contract typically far more restrictive than copyright law. The licensing contract typically includes statements that define the user community permitted to utilize the electronic information as well as terms that define the specific uses that this user community may make of the licensed electronic information. These terms typically do not reflect any consideration of public policy decisions such as fair use, and in fact the licensing organization may well be liable for what its patrons do with the licensed information.
Cliff's report was wide-ranging and insightful. In particular, he noted the change from the "first sale" doctrine legal framework to a publisher and library specific contract written by the publisher's lawyers.Accessibility and Integrity of Networked Information Collections
Page 30 (our emphasis)
Very few contracts with publishers today are perpetual licenses; rather, they are licenses for a fixed period of time, with terms subject to renegotiation when that time period expires. Libraries typically have no controls on price increase when the license is renewed; thus, rather than considering a traditional collection development decision about whether to renew a given subscription in light of recent price increases, they face the decision as to whether to lose all existing material that is part of the subscription as well as future material if they choose not to commit funds to cover the publisher's price increase at renewal time.
He pointed out that the change made future readers' access completely dependent upon continued payment and the publishers' whims, thus blocking libraries from fulfilling their critical stewardship role.Accessibility and Integrity of Networked Information Collections
Page 31 (our emphasis)
In 1995 I was part of the small team that developed Stanford's Highwire Press. Highwire was the first Web publishing platform for academic journals. By then the problems Cliff identified impacting libraries' stewardship role had become obvious. At the time I attended a lot of conferences. A frequent discussion topic was the ramifications of libraries transitioning from content ownership to content access. Many highly placed librarians thought the change was great – no more building collections, no more stewardship responsibility! I strongly disagreed. Hiking with David after one such conference I described how stewardship worked in the paper world and how it didn't in the Web world. His response was "I can build a system that works the way paper does".
David’s and my goal was to model the way paper worked, to provide librarians with an easy, familiar, affordable way to build and steward traditional collections that were migrating from paper to online.
Libraries fulfill their stewardship role when future access is ensured. Stewardship occurs when libraries take possession of and manage cultural and intellectual assets. We thought it vital for libraries to retain their stewardship role in the scholarly communication ecosystem. We didn't want them to become simply convenient places to work and drink coffee[1].
Stewardship matters for at least three reasons:
Stewardship protects privacy when librarians fight for their patrons’ rights.- To protect privacy.
- To protect first sale.
- To defend against censorship.
The American Library Association Bill of Rights
All people have a right to privacy. Librarians should safeguard the privacy of all library use.
VII. All people, regardless of origin, age, background, or views, possess a right to privacy and confidentiality in their library use. Libraries should advocate for, educate about, and protect people’s privacy, safeguarding all library use data, including personally identifiable information.
Adopted June 19, 1939, by the ALA Council; amended October 14, 1944; June 18, 1948; February 2, 1961; June 27, 1967; January 23, 1980; January 29, 2019.
Inclusion of “age” reaffirmed January 23, 1996.
Adopted June 19, 1939, by the ALA Council; amended October 14, 1944; June 18, 1948; February 2, 1961; June 27, 1967; January 23, 1980; January 29, 2019.
Inclusion of “age” reaffirmed January 23, 1996.
Stewardship protects ownership transfer when content is acquired.
The First Sale doctrine is pivotal. It enables the business of libraries. It enables libraries to maintain and circulate knowledge. First Sale ensures that the public, especially future generations, benefit from today's and yesterday's works of literature, science, and culture.
Stewardship resists censorship when there are multiple copies under multiple stewards.
Today, book banning is on the rise. Librarians are being forced to remove items from circulation. Content ownership ensures materials can’t be erased from view without detection. Stewardship of banned materials allows librarians to choose whether to safeguard these materials for future readers.
GovDocs Preservation
Government Documents are and always have been in the crosshairs of censors, I’ll mention four efforts providing countervailing forces:- US Docs LOCKSS Network:
James R. Jacobs, Stanford University - Canadian Government Information Digital Preservation Network:
Carla Graebner, Simon Fraser University - End of Term Web Archive Partners:
Jefferson Bailey, Internet Archive - Data Rescue Project:
dataresuceproject@protonmail.com
- First, the U.S. LOCKSS Docs Network. The Government Publishing Office (GPO), produces and distributes government documents. In the paper world, the Federal Depository Library Program distributed documents to over 1,000 libraries across the nation. To recall documents, the government had to contact the librarians and ask them to withdraw the materials. It was a transparent process.
This is a sample of withdrawn Federal documents.
Online, there were no censorship guardrails. In 2008 a small group of librarians formed the U. S. Docs LOCKSS network. This program is a digital instantiation of the U.S. Federal Depository Library Program. In partnership with the Government Publishing Office, participating libraries have recreated the distributed, transparent, censor resistant nature of the depository paper system.
This is a sample of volumes released this February to the U.S.Docs LOCKSS network - Second, the Canadian Government Information Digital Preservation Network. It consists of 11 academic libraries that use Archive-It (an Internet Archive service) to collect all Canadian federal documents. The collected documents are then moved from the Internet Archive into a local LOCKSS network for distributed safekeeping.
-
Third, the End of Term Web Archive
End of Term Web Archive PartnersThis partnership captures U.S. Government websites at the end of presidential administrations. With this last administrative change, thousands of federal web pages and datasets have been taken offline. Federal web sites hold information important to every corner of a university. The End of Term Archive is an extraordinarily important resource. Oddly only two universities partner with Archive-It to do this work: Stanford and the University of North Texas.
- Common Crawl Foundation
- Environmental Data & Governance Initiative
- Internet Archive
- Stanford University Libraries
- University of North Texas Libraries
- Webrecorder
- Last, there are many efforts to capture US data sets. The Data Rescue Project serves as a clearing house.
The community recently relearned a lesson history failed to teach. Digital preservation's biggest threat is insider attack. In recent months an unknown number of critical government databases are gone, or altered. The antidote to insider attack is multiple copies under multiple stewards. In LOCKSS language, let’s make it easy to find some of the copies, but hard to find all the copies.
Funding programs like LOCKSS is difficult. The LOCKSS Program reinstates stewardship and enables libraries as memory organizations. This is a hard sell. Librarians spend scarce resources to support current readers, spending them to ensure materials are available to tomorrow's readers ... not so much. While fundraising fluctuates, costs are steady. To ensure stability, we accumulated reserves by having a very lean staff and being stingy with salaries.
And then along came CLOCKSS, where publishers took the lead to establish a community run archive that implements library values. In 2006, a handful of publishers, notably the late Karen Hunter, Elsevier, suggested a partnership between libraries and publishers to form a community run archive. In 2008, after a pilot funded by the founding archive libraries, contributing publishers, and the Library of Congress' NDIIPP, the CLOCKSS archive went into production.
Identical copies of archived content are held in eleven libraries worldwide (Scotland, Australia, Japan, Germany, Canada, and six in the United States). This international footprint ensures content is safe from shifting ideologies, or nefarious players. As in all LOCKSS networks, if a bad actor tries to remove or change content, the technology warns humans to investigate.
The CLOCKSS founding librarians and publishers unanimously agreed that when archived content becomes unavailable, it will be hosted from multiple sources, open access. An example: Heterocycles was an important chemistry journal. Established in 1973, it abruptly ceased publication in 2023 after 50 years. Inexplicably the journal also disappeared from the publisher’s web site; current subscribers lost all access. The content was unavailable from anywhere.
Fortunately, the entire run of the Heterocycles journal was archived in CLOCKSS. In June 2024, two CLOCKSS archive libraries, the University of Edinburgh and Stanford University each made all 106 volumes open access on the web.
The CLOCKSS Archive is governed equally by publishers and librarians, in true community spirit. However, publishers provide the bulk of financial support, contributing 70% of incoming funds. Libraries contribute only 30%. Alicia Wise, CLOCKSS executive director reports this gap is wideningover time. Ironically, the publishers many librarians consider “rapacious” are paying for an archive that upholds traditional library values and protects content access for future readers.
After more than a quarter-century, the LOCKSS Program continues to collect, to preserve and to provide access to many genres of content. The business model has evolved, but the goals have persisted. I will now hand over to David to talk about the technology, which has also evolved and persisted.
Technology Design (David)
The ARL Serials Initiative forms part of a special campaign mounted by librarians in the 1980s against the high cost of serials subscriptions. This is not the first time that libraries have suffered from high serial prices. For example, in 1927 the Association of American Universities reported that:
The power imbalance between publishers and their customers is of long standing, and it especially affects the academic literature.[2] Simplistic application of Web technology drove a change from purchasing a copy of the content to renting access to the publisher's copy.[3] This greatly amplifies the preexisting power imbalance. Thus in designing the LOCKSS system, we faced three challenges:"Librarians are suffering because of the increasing volume of publications and rapidly rising prices. Of special concern is the much larger number of periodicals that are available and that members of the faculty consider essential to the successful conduct of their work. Many instances were found in which science departments were obligated to use all of their allotment for library purposes to purchase their periodical literature which was regarded as necessary for the work of the department"From Report of the ARL Serials Prices Project, 1989
- to model for the Web the way libraries worked on paper,
- to somehow do so within the constraints of contract law and copyright,
- to ensure the system cost was negligible compared to subscription costs.
LOCKSS Design Goals
The system I envisaged on the hike would consist of a LOCKSS box at each library, the digital analog of the stacks, that would hold the content the library had purchased. It would need these characteristics of the paper system:
Allow libraries to:
- Collect journals to which they subscribed
- Give current readers access to their collection
- Preserve their collection for future readers
- Cooperate with other libraries
- It would allow libraries to collect material to which they subscribed from the Web.
- It would allow libraries' readers to access material they had collected.
- It would allow them to preserve their collections against the multiple frailties of digital information.
- It would allow libraries to cooperate, the analog of inter-library loan and copy.
Collect
The collect part was both conceptually simple and mostly off-the-shelf. Since the journals were pay-walled, as with paper each library had to collect its own subscription content. But collecting content is what Web browsers do. When they fetch content from a URL they don't just display it, they store it in a cache on local storage. They can re-display it without re-fetching it. The system needed a browser-like "Hotel California" cache that never got flushed, and a Web crawler like those of search engines so that all the library's subscribed content ended up in the cache.Because we lacked "first sale" rights, the crawler had to operate with permission from the publisher, which took the form of a statement on their Web site. No permission, no collection.
Access
The access part was also conceptually simple and mostly off-the-shelf. Readers should see the content from the publisher unless it wasn't available. Their LOCKSS box should act as a transparent Web proxy, forwarding requests to the publisher and, if the response were negative, responding with the cached copy.Preserve
The preserve part was conceptually simple — just don't remove old content from the cache on disk. But it was more of a problem to implement for three reasons:- Disks are not 100% reliable in the short term and are 0% reliable over library timescales. Over time, content in the cache would get corrupted or lost.
- Because libraries were under budget pressure and short of IT resources, the hardware of the LOCKSS box had to be cheap, thus not specially reliable.
- Content in the cache would be seen by humans only in exceptional circumstances, so detecting corruption or loss could not depend upon humans.
Cooperate
Cooperation provided the solution to the problems of preservation. We expected considerable overlap between libraries' subscriptions. Thus each journal would be collected by many libraries, just as in the paper system. The LOCKSS boxes at each library subscribing to each journal could compare their versions, voting on it in a version of the standard Byzantine Fault Tolerance algorithm. A library that lost a vote could repair their damaged copy from another library.The goal of stewardship drove LOCKSS' approach to preservation; given a limited budget and a realistic range of threats, data survives better in many cheap, unreliable, loosely-coupled replicas than in a single expensive, durable one.
Technology Lessons (David)
Our initial vision for the system was reasonably simple, but "no plan survives contact with the enemy" and so it was as we developed the system and deployed it in production. Now for some lessons from this process that are broadly applicable.Format Migration
In January 1995 the idea that the long-term survival of digital information was a significant problem was popularized by Jeff Rothenberg's Scientific American article Ensuring the Longevity of Digital Documents. Rothenberg's concept of a "digital document" was of things like Microsoft Word files on a CD, individual objects encoded in a format private to a particular application. His concern was that the rapid evolution of these applications would eventually make it impossible to access the content of objects in that format. He was concerned with interpreting the bits; he essentially assumed that the bits would survive.But thirty years ago next month an event signaled that Rothenberg's concerns had been overtaken by events. Stanford pioneered the transition of academic publishing from paper to the web when Highwire Press put the Journal of Biological Chemistry on the Web. Going forward the important information would be encoded in Web formats such as HTML and PDF. Because each format with which Rothenberg was concerned was defined by a single application it could evolve quickly. But Web formats were open standards, implemented in multiple applications. In effect they were network protocols, and thus evolve at a glacial pace.[4]
The rapid evolution of Rothenberg's "digital documents" had effectively stopped, because they were no longer being created and distributed in that way. Going forward, there would be a static legacy set of documents in these formats. Libraries and archives would need tools for managing those they acquired, and eventually emulation, the technique Rothenberg favored, would provide them. But by then it turned out that, unless information was on the Web, almost no-one cared about it.
Thus the problem for digital preservation was the survival of the bits, aggravated by the vast scale of the content to be preserved. In May the following year 2004 Paul Evan Peters awardee Brewster Kahle established the Internet Archive to address the evanescence of Web pages.[5] This was the first digital preservation effort to face the problems of scale - next year the archive will have collected a trillion Web pages.[6]
The LOCKSS system, like the Wayback Machine, was a system for ensuring the survival of, and access to, the bits of Web pages in their original format. This was a problem; the conventional wisdom in the digital preservation community was that the sine qua non of digital preservation was defending against format obsolescence. Neither Kahle nor we saw any return on investing in format metadata or format migration. We both saw scaling up to capture more than a tiny fraction of the at-risk content as the goal. Events showed we were right, but at the time the digital preservation community viewed LOCKSS with great skepticism, as "not real digital preservation".
The LOCKSS team repeatedly made the case that preserving Web content was a different problem from preserving Rothenberg's digital documents, and thus that applying the entire apparatus of "preservation metadata", PREMIS, FITS, JHOVE, and format normalization to Web content was an ineffective waste of scarce resources. Despite this, the drumbeat that LOCKSS wasn't "real digital preservation" continued.
After six years, the LOCKSS team lost patience and devoted the necessary effort to implement a capability they were sure would never be used in practice. The team implemented, demonstrated and in 2005 published transparent, on-demand format migration of Web content preserved in the LOCKSS network. This was possible because the specification of the HTTP protocol that underlies the Web supports the format metadata needed to render Web content. If it lacked such metadata, Web browsers wouldn't be possible. The criticism continued unabated.[7]
There have been a number of services based instead upon emulation, the technique Rothenberg preferred. Importantly, Ilya Kreymer's oldweb.today uses emulation to show preserved Web content as it did in a contemporaneous browser not as it does in a modern browser.
Dynamic Content
Around 6th December 1991 Paul Kunz at the Stanford Linear Accelerator Center bought up the first US Web site.[8] In a foreshadowing of future problems its content was dynamic. It was a front-end for querying databases; although the page itself was static clicking on the links potentially returned different content as the underlying database was edited.
Digital documents in a distributed environment may not behave consistently; because they are presented both to people who want to view them and software systems that want to index them by computer programs, they can be changed, perhaps radically, for each presentation. Each presentation can be tailored for a specific recipient.
Cliff Lynch identified the problem that dynamic content posed for preservation. In 2001 he wrote "Each presentation can be tailored for a specific recipient". Which recipient's presentation deserves to be preserved? Can we show a future recipient what they would have seen had they accessed the resource in the past?When documents deceive: Trust and provenance as new factors for information retrieval in a tangled web
there’s a largely unaddressed crisis developing as the dominant archival paradigms that have, up to now, dominated stewardship in the digital world become increasingly inadequate. ... the existing models and conceptual frameworks of preserving some kind of “canonical” digital artifacts ... are increasingly inapplicable in a world of pervasive, unique, personalized, non-repeatable performances.
Sixteen years later Lynch was still flagging the problem.Stewardship in the "Age of Algorithms
The dynamic nature of Web content proved irresistible to academic journal publishers, despite their content being intended as archival. They added features like citation and download counts, personalizations, and of course advertisements to HTML pages, and watermarked their PDFs. These were all significant problems for LOCKSS, which depended upon comparing the copies ingested by multiple LOCKSS boxes. The comparison process had to filter out the dynamic content elements; maintaining the accuracy of doing so was a continual task.[9]
The fundamental problem is that the Web does not support Universal Resource Names (URNs) but only Universal Resource Locators (URLs). A URN would specify what a resource would consist of, all that a URL specifies is from where a resource can be obtained. As with the first US Web page, what content you obtain from a URL is unspecified and can be different or even unobtainable on every visit.
The reason the Web runs on URLs not URNs is that the underlying Internet's addresses, both IP and DNS, only specify location. There have been attempts to implement a network infrastructure that would support "what" not "where" addresses; think of it as BitTorrent, but at the transport not the content layer.[10]
The goal of digital preservation is to create one or more persistent, accessible replicas of the content to be preserved. In "what" networks, each copy has the same URN. In IP-based networks, each copy has a different URL; to access the replica requires knowing where it is. Thus if the original of the preserved content goes away, links to it no longer resolve.
Starting in 2010, 2017 Paul Evan Peters awardee Herbert Van de Sompel and others made a valiant effort to solve this problem with Memento. Accepting the fact that persistent replicas of content at a URL at different times in the past would have different URLs, they provided an HTTP-based mechanism for discovering the URL of the replica close to a desired time. In some cases, such as Wikis, the original Web site implements the discovery mechanism and the underlying timeline. In other cases, such as the Wayback Machine, the site holding the replica implements the timeline. Since there are likely to be multiple Web archives with replicas of a given URL, Memento in practice depends upon Aggregator services to provide a unified timeline of the replica space.
In "what" networks there would still be a need to provide an aggregated timeline, not discovering the URL of a replica from a desired time, but discovering its URN. Just as in the Web, they would depend upon a mechanism above the transport layer to connect the names into a timeline. Thus, despite its theoretical appeal, "what" networking's practical advantages are less than they appear.
Decentralization
When something is published in print, legitimate copies ... are widely distributed to various organizations, such as libraries, which maintain them as public record. These copies bear a publication date, and the publisher essentially authenticates the claims of authorship ... By examining this record, control of which is widely distributed ... it is possible, even years after publication, to determine who published a given work and when it was published. It is very hard to revise the published record, since this involves all of the copies and somehow altering or destroying them.
In 1994 Lynch had described how "Lots Of Copies Keep Stuff Safe" in the paper world. Compare this with how we summarized libraries' role in our first major paper on LOCKSS, Permanent Web Publishing:The integrity of digital information: Mechanics and definitional issues
Acquire lots of copies. Scatter them around the world so that it is easy to find some of them and hard to find all of them. Lend or copy your copies when other librarians need them.Because we were modeling the paper library system, we hoped that the LOCKSS system would obtain the benefits of a decentralized system over a centralized one performing the same function, which in the paper system and in theory are significant:
- It can be more resilient to failures and attacks.
- It can resist acquisition and the consequent enshittification.
- It can scale better.
- It has the economic advantage that it is hard to compare the total system cost with the benefits it provides because the cost is diffused across many independent budgets.
Unfortunately in the digital world it is extraordinarily difficult to reap the theoretical benefits of decentralization. I laid out the reason why this is so a decade ago in Economies of Scale in Peer-to-Peer Networks. In brief, the mechanism described by W. Brian Arthur in his 1994 book Increasing Returns and Path Dependence in the Economy operates. Technology markets have very strong increasing returns to scale. The benefits from participating in a decentralized digital system increase faster than the costs, which drives centralization.[11] Thirty years later, Arthur's work explains today's Web perfectly.
In Decentralized Systems Aren't I explained that they suffer four major problems:
The LOCKSS system was designed and implemented to be completely decentralized. It was permissionless; nodes could join and leave the network as they liked. We designed the network protocol to be extremely simple, both to avoid security flaws, and also in the hope that there would be multiple implementations, avoiding single points of failure. There were a number of reasons why, over time, it turned out much less decentralized than we hoped:- Their advantages come with significant additional monetary and operational costs, such as the massive cost of the Bitcoin blockchain's Proof-of-Work.
- Their user experience is worse, being more complex, slower and less predictable. An example is that Bitcoin's transaction rate is limited by its 10-second block time.
- They are in practice only as decentralized as the least decentralized layer in the stack.
- Their excess costs cause emergent behaviors that drive centralization.
- Although we always paid a lot of attention to the security of LOCKSS boxes, we understood that a software mono-culture was vulnerable to software supply chain attacks. But it turned out that the things that a LOCKSS box needed to do other than handling the protocol were quite complex, so despite our best efforts we ended up with a software monoculture.
- We hoped that by using the BSD open-source license we would create a diverse community of developers, but we over-estimated the expertise and the resources of the library community, so Stanford provided the overwhelming majority of the programming effort.
- Don Waters was clear that grant funding could not provide the long-term sustainability needed for digital preservation. So he provided a matching grant to fund the transition to being funded by the system's users. This also transitioned the system to being permissioned, as a way to ensure the users paid.
- Although many small and open-access publishers were happy to allow LOCKSS to preserve their content, the oligopoly publishers never were. Eventually they funded a completely closed network of huge systems at major libraries around the world called CLOCKSS. This is merely the biggest of a number of closed, private LOCKSS networks that were established to serve specific genres of content, such as government documents.
Archival Media (David)
Don't, don't, don't, don't believe the hype!
Public Enemy
Public Enemy
We have already warned you against three seductive but impractical ideas; format migration, "what" networking and decentralization. My parting gift to you is to stop you wasting time on another seductive but impractical idea — that the solution to digital preservation is quasi-immortal media. What follows is an extract from a talk at Berkeley last month.
Archival Data
- Over time, data falls down the storage hierarchy.
- Data is archived when it can't earn its keep on near-line media.
- Lower cost is purchased with longer access latency.
Hype
The mainstream media occasionally comes out with an announcement like this from the Daily Mail in 2013, or this from the New Yorker last month. Note the extrapolation from "a 26 second excerpt" to "every film and TV program ever created in a teacup".Six years later, this is a picture of, as far as I know, the only write-to-read DNA storage drive ever demonstrated, from the Microsoft/University of Washington team that has done much of the research in DNA storage. It cost about $10K and took 21 hours to write then read 5 bytes.
The technical press is equally guilty. The canonical article about some development in the lab starts with the famous IDC graph projecting the amount of data that will be generated in the future. It goes on to describe the amazing density some research team achieved by writing say a megabyte into their favorite medium in the lab, and how this density could store all the world's data in a teacup for ever. This conveys five false impressions.
Market Size
First, that there is some possibility the researchers could scale their process up to a meaningful fraction of IDC's projected demand, or even to the microscopic fraction of the projected demand that makes sense to archive. There is no such possibility. Archival media is a much smaller market than regular media. IBM's Georg Lauhoff and Gary M Decad's slide shows that the size of the market in dollar terms decreases downwards. LTO tape is less than 1% of the media market in dollar terms and less than 5% in capacity terms.[14]Timescales
Second, that the researcher's favorite medium could make it into the market in the timescale of IDC's projections. Because the reliability and performance requirements of storage media are so challenging, time scales in the storage market are much longer than the industry's marketeers like to suggest.Take, for example, Seagate's development of the next generation of hard disk technology, HAMR, where research started twenty-six years ago. Nine years later in 2008 they published this graph, showing HAMR entering the market in 2009. Seventeen years later it is only now starting to be shipped to the hyper-scalers. Research on data in silica started fifteen years ago. Research on the DNA medium started thirty-six years ago. Neither is within five years of market entry.
Customers
Third, that even if the researcher's favorite medium did make it into the market it would be a product that consumers could use. As Kestutis Patiejunas figured out at Facebook more than a decade ago, because the systems that surround archival media rather than the media themselves are the major cost, the only way to make the economics of archival storage work is to do it at data-center scale but in warehouse space and harvest the synergies that come from not needing data-center power, cooling, staffing, etc.Storage has an analog of Moore's Law called Kryder's Law, which states that over time the density of bits on a storage medium increases exponentially. Given the need to reduce costs at data-center scale, Kryder's Law limits the service life of even quasi-immortal media. As we see with tape robots, where data is routinely migrated to newer, denser media long before its theoretical lifespan, what matters is the economic, not the technical lifespan of a medium.
The Cloud
Fourth, that anyone either cares or even knows what medium their archived data lives on. Only the hyper-scalers do. Consumers believe their data is safe in the cloud. Why bother backing it up, let alone archiving it, if it is safe anyway? If anyone really cares about archiving they use a service such as Glacier, when they definitely have no idea what medium is being used.Threats
Fifth, the idea that with quasi-immortal media you don't need Lots Of Copies to Keep Stuff Safe.[15]Media such as silica, DNA, quartz DVDs, steel tape and so on address bit rot, which is only one of the threats to which long-lived data is subject. Clearly a single copy on such media is still subject to threats including fire, flood, earthquake, ransomware, and insider attacks. Thus even an archive needs to maintain multiple copies. This greatly increases the cost, bringing us back to the economic threat.
The reason why this focus on media is a distraction is that the cost per terabyte of the medium is irrelevant, what drives the economic threat is the capital and operational cost of the system. It is only by operating at data-center scale and thus amortizing the capital and operational costs over very large amounts of data that the system costs per terabyte can be made competitive.
The fundamental idea behind LOCKSS was that, given a limited budget and a realistic range of threats, data would survive better in many cheap, unreliable, loosely-coupled replicas than in a single expensive, durable one.
Questions
When giving talks about LOCKSS Vicky or I often used to feel like the Sergeant in Alice's Restaurant who "spoke for 45 minutes and nobody understood a word he said". We hope that this time we did better. Lets see if we did as we answer your questions.Footnotes
- In 2006 Vicky predicted that, without collection stewardship, libraries and Starbucks would become indistinguishable. Here is a real Starbucks ad, with one minor addition.
Four years later this prediction came true; Starbucks populated its WiFi networks with a wide range of otherwise pay-walled content such as the Wall Street Journal. - Library budgets have struggled with journal costs for close on a century, if not longer!
- That is, from a legal framework of the "first sale" doctrine and copyright, to one of contract law and copyright.
- The deployment of IPv6, introduced in December 1995, shows that network protocols are extraordinarily difficult to evolve, because of the need for timely updates to many independent implementations. Format obsolescence implies backwards incompatibility; this is close to impossible in network protocols because it would partition the network. As I discussed in 2012's Formats Through Time, the first two decades of the Web showed that Web formats essentially don't go obsolete.
- This evanescence comes in two forms, link rot, when links no longer resolve, and content drift, when they resolve to different content.
- People's experience of the reliability of their personal data storage is misleading. Reliable, affordable long-term storage at Web scale is an interesting engineering problem.
- The irony of this was that format migration was a technique of which Rothenberg’s article disapproved:
Finally, [format migration] suffers from a fatal flaw. ... Shifts of this kind make it difficult or impossible to translate old documents into new standard forms.
- The first US Web page has been resuscitated by the Stanford Library.
- At least the journals we archived were not malicious; they had actual content that was the same for everybody. That different readers saw different ads was of interest only to students of advertising. But the opportunity to confine readers in a tailored bubble has turned out to be profitable but disastrous.
- Van Jacobson and a team at PARC started a long-term project called Content-Centric Network and gave a 2006 Google Tech Talk about it.
In 2013's Moving vs. Copying I described the difference thus:
- The goal of IP and the layers above is to move data. There is an assumption that, in the normal case, the bits vanish from the sender once they have been transported, and also from any intervening nodes.
- The goal of CCN is to copy data. A successful CCN request creates a locally accessible copy of some remote content. It says nothing about whether in the process other (cached) copies are created, or whether the content is deleted at the source. None of that is any concern of the CCN node making the request, they are configuration details of the underlying network. While it has its copy, the CCN node can satisfy requests from other nodes for that content, it is a peer-to-peer network.
-
The Nakamoto coefficient is the number of units in a subsystem you need to control 51% of that subsystem. Because decentralization applies at each layer of a system's stack, it is necessary to measure each of the subsystems individually. In 2017's Quantifying Decentralization Srinivasan and Lee identified a set of subsystems for public blockchains, and measured them using their proposed "Nakamoto Coefficient". Their table of the contemporary Nakamoto coefficients for Bitcoin and Ethereum makes the case that they were only minimally decentralized.Subsystem Bitcoin Ethereum Mining 5 3 Client 1 1 Developer 5 2 Exchange 5 5 Node 3 4 Owner 456 72
There is an even bigger problem for Ethereum since the blockchain switched to Proof-of-Stake. The software that validators run is close to a mono-culture. Two of the minor players have recently suffered bugs that took them off-line, as Sam Kessler reports in Bug That Took Down 8% of Ethereum's Validators Sparks Worries About Even Bigger Outage:Source
A bug in Ethereum's Nethermind client software – used by validators of the blockchain to interact with the network – knocked out a chunk of the chain's key operators on Sunday.
Remember "no-one ever gets fired for buying IBM"? At the Ethereum layer, it is "no-one ever gets fired for using Geth" because, if there was ever a big problem with Geth, the blame would be so widely shared.
...
Nethermind powers around 8% of the validators that operate Ethereum, and this weekend's bug was critical enough to pull those validators offline. ... the Nethermind incident followed a similar outage earlier in January that impacted Besu, the client software behind around 5% of Ethereum's validators.
...
Around 85% of Ethereum's validators are currently powered by Geth, and the recent outages to smaller execution clients have renewed concerns that Geth's dominant market position could pose grave consequences if there were ever issues with its programming.
...
Cygaar cited data from the website execution-diversity.info noting that popular crypto exchanges like Coinbase, Binance and Kraken all rely on Geth to run their staking services. "Users who are staked in protocols that run Geth would lose their ETH" in the event of a critical issue," Cygaar wrote. - Vitalik Buterin, inventor of Ethereum, pointed out in The Meaning of Decentralization:
In the case of blockchain protocols, the mathematical and economic reasoning behind the safety of the consensus often relies crucially on the uncoordinated choice model, or the assumption that the game consists of many small actors that make decisions independently. ... However, can we really say that the uncoordinated choice model is realistic when 90% of the Bitcoin network’s mining power is well-coordinated enough to show up together at the same conference?
What Buterin is saying is that because decentralized systems in the real world are not composed of "many small actors that make decisions independently", there is nothing to stop the small number of large actors colluding, and thus acting as a centralized system. -
How long should the archived data last? The Long Now Foundation is building the Clock of the Long Now, intended to keep time for 10,000 years:By Pkirlin CC BY-SA 3.0
Source
Ten thousand years is about the age of civilization, so a 10K-year Clock would measure out a future of civilization equal to its past. That assumes we are in the middle of whatever journey we are on – an implicit statement of optimism.
They would like to accompany it with a 10,000-year archive. That is at least two orders of magnitude longer than I am talking about here. We are only just over three-quarters of a century from the first stored-program computer, so designing a digital archive for a century is a very ambitious goal. Note that the design of the Clock of the Long Now is as much social as technical. It is designed to motivate infrequent but continual pilgrimages:
On days when visitors are there to wind it, the calculated melody is transmitted to the chimes, and if you are there at noon, the bells start ringing their unique one-time-only tune. The 10 chimes are optimized for the acoustics of the shaft space, and they are big.
Finally, way out of breath, you arrive at the primary chamber. Here is the face of the Clock. A disk about 8 feet in diameter artfully displays the natural cycles of astronomical time, the pace of the stars and the planets, and the galactic time of the Earth’s procession. If you peer deep into the Clock’s workings you can also see the time of day.
But in order to get the correct time, you need to “ask” the clock. When you first come upon the dials the time it displays is an older time given to the last person to visit. If no one has visited in a while, say, since 8 months and 3 days ago, it will show the time it was then. To save energy, the Clock will not move its dials unless they are turned, that is, powered, by a visitor. The Clock calculates the correct time, but will only display the correct time if you wind up its display wheel. - It is noteworthy that in 2023 Optical Archival (OD-3), the most recent archive-only medium, was canceled for lack of a large enough market. It was a 1TB optical disk, an upgrade from Blu-Ray.
- No medium is perfect. They all have a specified Unrecoverable Bit Error Rate (UBER) rate. For example, typical disk UBERs are 10-15. A petabyte is 8*1015 bits, so if the drive is within its specified performance you can expect up to 8 errors when reading a petabyte. The specified UBER is an upper limit, you will normally see far fewer. The UBER for LT09 tape is 10-20, so unrecoverable errors on a new tape are very unlikely. But not impossible, and the rate goes up steeply with tape wear.
The property that classifies a medium as quasi-immortal is not that its reliability is greater than regular media to start with, although as with tape it may be. It is rather that its reliability decays more slowly than that of regular media. Thus archival systems need to use erasure coding to mitigate both UBER data loss and media failures such as disk crashes and tape wear-out.
No comments:
Post a Comment