Thursday, November 1, 2018

Ithaka's Perspective on Digital Preservation

Oya Rieger of Ithaka S+R has published a report entitled The State of Digital Preservation in 2018: A Snapshot of Challenges and Gaps. In June and July Rieger:
talked with 21 experts and thought leaders to hear their perspectives on the state of digital preservation. The purpose of this report is to share a number of common themes that permeated through the conversations and provide an opportunity for broader community reaction and engagement, which will over time contribute to the development of an Ithaka S+R research agenda in these areas.
Below the fold, a critique.

The first thing to notice is that the list of interviewees includes only managers. It lacks anyone actively developing tools and services for digital preservation, it lacks anyone whose hands are actually on the tasks of digital preservation. Rieger writes:
This was not based on a balanced sampling of individuals involved in different stages of digital preservation, holding distinctive roles (e.g., senior library leaders, director of preservation services, preservation specialist/technician, IT specialist, archivist, etc.).
This is the bureaucracy's view of the landscape. Not that their perspective isn't valuable, but it is just one view that can be somewhat out of touch with "ground truth" (as illustrated by the misspelling of Memento, Archive-It, and MetaArchive, and the mis-attribution of the BitCurator Consortium). As a result, the report lacks any real reference to the technical, as opposed to the organizational and educational challenges facing the field.

Second, there is very little coverage of Web archiving, which is clearly by far the largest and most important digital preservation initiative both for current and future readers. The Internet Archive rates only two mentions, in the middle of a list of activities and in a footnote. This is despite the fact that is currently the 211th most visited site in the US (272nd globally) with over 5.5M registered users, adding over 500 per day, and serving nearly 4M unique IPs per day. For comparison, the Library of Congress currently ranks 1439th in the US (5441st globally). The Internet Archive's Web collection alone probably dwarfs all other digital preservation efforts combined both in size and in usage. Not to mention their vast collections of software, digitized books, audio, video and TV news..

Rieger writes:
There is a lack of understanding about how archived websites are discovered, used, and referenced. “Researchers prefer to cite the original live-web as it is easier and shorter,” pointed out one of the experts. “There is limited awareness of the existence of web archives and lack of community consensus on how to treat them in scholarly work. The problems are not about technology any more, it is about usability, awareness, and scholarly practices.” The interviewee referred to a recent CRL study based on an analysis of referrals to archived content from papers that concluded that the citations were mainly to articles about web archiving projects.
It is surprising that the report doesn't point out that the responsibility for educating scholars in the use of resources lies with the "experts and thought leaders" from institutions such as the University of California, Michigan State, Cornell, MIT, NYU and Virginia Tech. That these "experts and thought leaders" don't consider the Internet Archive to be a resource worth mentioning might have something to do with the fact that their scholars don't know that they should be using it.

A report whose first major section, entitled "What's Working Well", totally fails to acknowledge the single most important digital preservation effort of the last two decades clearly lacks credibility

Third, the report ignores the most important, and long overdue, new digital preservation effort of the last two years. In that short period the Software Heritage Foundation has made huge progress in collecting and preserving software source code in a useful form:
the Software Heritage Archive contains more than four billion unique source code files and one billion individual commits, gathered from more than 80 million publicly available source code repositories (including a full and up-to-date mirror of GitHub) and packages (including a full and up-to-date mirror of Debian). Three copies are currently maintained, including one on a public cloud.

As a graph, the Merkle DAG underpinning the archive consists of 10 billion nodes and 100 billion edges; in terms of resources, the compressed and fully de-duplicated archive requires some 200TB of storage space
As I wrote in 2013:
Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.
The blind spot that has prevented libraries and archives from collecting, preserving, and making available to scholars the extraordinary collaborative cultural achievement represented by the open source code base is really astonishing.

Fourth, the report fails to notice the important developments from the teams at the University of Freiburg and at the Internet Archive in making emulation of preserved software binaries both routine and scalable. The Internet Archive's software collection now includes over 100K titles.

Finally, there is no acknowledgement that the most serious challenge facing the field is economic. Except for a few corner cases, we know how to do digital preservation, we just don't want to pay enough to have it done. Thus the key challenge is to achieve some mixture of significant increase in funding for, and significant cost reduction in the processes of, digital preservation.

Information technology processes naturally have very strong economies of scale, which result in winner-take-all markets (as W. Brian Arthur pointed out in 1985). It is notable that the report doesn't mention the winners we already have, in Web and source code archiving, and in emulation. All are at the point where a competitor is unlikely to be viable.

To be affordable, digital preservation needs to be done at scale. The report's orientation is very much "let a thousand flowers bloom", which in IT markets only happens at a very early stage. This is likely the result of talking only to people nurturing a small-scale flower, not to people who have already dominated their market niche. It is certainly a risk that each area will have a single point of failure, but trying to fight against the inherent economics of IT pretty much guarantees ineffectiveness.

No doubt Rieger will defend the report by saying, correctly, that it mostly represents a summary of what she was told by the "experts and thought leaders" of the digital preservation field. If so, it reveals a disturbing insularity among those "experts and thought leaders".

The report's final section is Rieger's. It identifies three "Potential Research Areas":
  1. Building a "cohesive and compelling roadmap".
  2. Dealing with the "notions of ownership and control".
  3. Enunciating a "strong set of value propositions" to justify increased funding.
My reaction to these is "meh".

1) The big successes in the field haven't come from consensus building around a roadmap, they have come from idiosyncratic individuals such as Brewster Kahle, Roberto di Cosmo and Jason Scott identifying a need and building a system to address it no matter what "the community" thinks. We have a couple of decades of experience showing that "the community" is incapable of coming to a coherent consensus that leads to action on a scale appropriate to the problem. In any case, describing road-mapping as "research" is a stretch.

2) Under severe funding pressure, almost all libraries have de-emphasized their custodial role of building collections in favor of responding to immediate client needs. Rieger writes:
As one interviewee stated, library leaders have “shifted their attention from seeing preservation as a moral imperative to catering to the university’s immediate needs.”
Regrettably, but inevitably given the economics of IT markets, this provides a market opportunity for outsourcing. Ithaka has exploited one such opportunity with Portico. This bullet does describe "research" in the sense of "market research".  Success is, however, much more likely to come from the success of an individual effort than from a consensus about what should be done among people who can't actually do it.

3) In the current climate, increased funding for libraries and archives simply isn't going to happen. These institutions have shown a marked reluctance to divert their shrinking funds from legacy to digital media. Thus the research topic with the greatest leverage in turning funds into preserved digital content is into increasing the cost-effectiveness of the tools, processes and infrastructure of digital preservation.


Dorothea said...

If you haven't already read Alissa Centivany's "The Dark History of HathiTrust," I think you would enjoy it; it is additional evidence for your argument about individuals rather than collectives taking responsibility for digital preservation.

David. said...

@LibSkrat tweets:

"This is a repeated problem with Ithaka research. They are positively ALLERGIC to talking to anybody who isn’t a manager.

It really damages the credibility of their research."

There's a misunderstanding here.

Ithaka S+R are consultants. The goal of a consultant is more contracts. Their market is managers; only managers can hire them, or pay for expensive reports. So what they need to know is what managers want to hear. Its pointless for them to talk to the manager's staff, or to write what the staff want to hear. What Ithaka S+R calls "research" is really "marketing" for future "research"; setting "the community's" agenda in ways that require lots more "research" by Ithaka S+R.

Institutions like the Internet Archive or Software Heritage don't need Ithaka S+R to tell them what to do; they have a clear mission and they're sticking to it. So there's no money for Ithaka S+R in paying attention to them.

I'm not blaming Ithaka S+R for behaving this way. Their "Slow AI" makes them do what they need to do to stay in business (see also Cory Doctorow). The real problem is the insularity of the "experts and thought leaders".

David. said...

In response to this post, someone tweeted a link to this short video.

David. said...

See also the tooltip on this XKCD. As regards, I actually did.