Tuesday, April 16, 2019

The Demise Of The Digital Preservation Network

Now I've had a chance to read the Digital Preservation Network (DPN): Final Report I feel the need to add to my initial reactions in Digital Preservation Network Is No More, which were based on Roger Schonfeld's Why Is the Digital Preservation Network Disbanding?. Below the fold, my second thoughts.

The DPN started in 2012 and:
it was anticipated that there would be different nodes specializing in different types of content (e.g., text, data and moving images) and providing replication, audit, succession etc. at the bit level across the nodes; and 2) relatedly, the goal was to start at the most basic level (i.e., bit-level preservation with audit and succession) and then start working up the stack of services that are involved in full-blown digital preservation
What was the landscape of digital preservation back in 2012 that motivated the DPN? A year earlier, I had written A Brief History of E-Journal Preservation. Referring to it, we see that by 2012:
  • The LOCKSS Program was 14 years old, had been in production use 8 years, and had been economically self-sustaining for 5 years. There had been three designs and two complete implementations of the protocol by which LOCKSS boxes communicated, the last of which was based on award-winning computer science research.
  • Portico had been in production for 8 years but initially it:
    failed to achieve economic sustainability on its own. As Bill Bowen said discussing the Blue Ribbon Task Force Report:
    "it has been more challenging for Portico to build a sustainable model than parts of the report suggest."
    Libraries proved unwilling to pay enough to cover its costs. It was folded into a single organization with JSTOR, in whose $50M+ annual cash flow Portico's losses could be buried.
Thus e-journal preservation systems had a lot of experience showing that the real problem was economic, not technical, and that ingest was the largest cost. The LOCKSS team’s rule of thumb was that it was half the lifetime cost, with preservation being a third and access a sixth. And ingesting e-journals was cheap and easy compared to the less well-organized content the DPN hoped to target.

E-journal preservation economics were based on protecting institutions’ investment in expensive subscription content. Elsewhere, things were less sustainable. Institutional repositories contained little, and what they did was not very important. The reason was that getting stuff into them was too hard and costly.

As I wrote in my initial DPN post:
Each of the libraries represented had made significant investments in establishing an institutional repository, which was under-utilized due to the difficulty of persuading researchers to deposit materials. With the video collection out of the picture as too expensive, the librarians seized on diversity as the defense against the monoculture threat to preservation. In my view there were two main reasons:
  • Replicating pre-ingested content from other institutions was a quicker and easier way to increase the utilization of their repository than educating faculty.
  • Jointly marketing a preservation service that, through diversity, would be more credible than those they could offer individually was a way of transferring money from other libraries' budgets to their repositories' budgets.
Alas, this meant that the founders' incentives were not aligned with their customers'.
Of course, the diversity goal also meant that the DPN was an add-on to their existing institutional repositories. A hypothetical converged system would have been a threat to them.

The DPN’s pitch to customers was, in effect, that it would be a better institutional repository than one they ran themselves. Making the economics of "institutional repository as a service" sustainable required greatly improving the ingest process at each node for the content type in which it specialized. That was what would determine the operational expenses, and thus the prices the DPN needed to charge. Doing so posed major:
  • design problems, because metadata for the content was not standardized between the submitting institutions (unlike the fairly standard e-journal metadata),
  • implementation problems, because there were no off-the shelf solutions, and
  • cost problems, because this required site- and content-type-specific development, not development shared between the nodes.
The technical goal DPN’s management set themselves wasn’t to solve this critical customer-facing business model problem, it was to solve the internal problem of replicating and auditing the content that wasn't going to be in the nodes in the first place; it was too hard to get it in.  Despite the fact that replicating and auditing was a problem that could have been solved by assembling off-the-shelf, production proven components, it took them 2 years to hire a technical lead capable of reaching consensus on how to solve it:
In December of 2014, Dave Pcolar was hired as the Chief Technical Officer and with his leadership and direction, a consensus was reached on the best approach to develop the network.
The consensus was that the nodes would export a custom REST API. Because diversity was the whole point of the DPN, each node had to implement both the server and client sides of the API to integrate with their existing repository infrastructure. Pretty much the only shared implementation effort was the API specification. Which, of course, is what the diversity goal was intended to achieve.

The problem was that the participating institutional repositories were uneconomic and mostly empty. It could not be solved without making the ingest process much cheaper and easier. After all, someone was going to have to do the work and pay the cost of ingest. Not realizing this was a major management failure. As the final report shows, the customers told them that this was the requirement:
institutions repeatedly stated that they did not have a good workflow for digital preservation. Many institutions said that they did not have sufficient in-depth knowledge of their digital collections to manage them for long-term preservation. Local systems for managing content did not have a built-in “export to DPN” function and this presented a problem of how to prepare and move the content for deposit into DPN.
But that wasn't the real management failure. It was true that diversity improved the network's robustness against hypothetical future attacks and failures. The fundamental management failure was not to appreciate that, in return for this marginal future benefit, diversity immediately guaranteed that the product they had to offer would be more expensive and take longer to build, be more expensive to operate and maintain, and be more complex and thus less reliable than a centralized commercial competitor. Several of which duly arrived in the market before DPN did.

Indeed, a year before DPN started the commercial pioneer of outsourced institutional repositories, bepress, was already focused on this area. There was clearly a market for outsourcing institutional repositories. By 2017, bepress had:
more than 500 participating institutions, predominantly US colleges and universities. bepress claims a US market share of approximately 50% overall, recognizing that not all institutions have an institutional repository. Among those universities that conduct the greatest amount of research, for example the 115 US universities with highest research activity, bepress lists 34 as Digital Commons participants, for a market share of about 30%.
DPN management should have been aware of the potential competition. They could have reviewed the problem at the start, saying to the sponsoring institutions:
The diversity thing isn't going to be viable. What the world needs is a major improvement in the cost and ease-of-use of institutional repository ingest. Why don't we spend the money on that instead?
Unfortunately, this wouldn't have worked for two main reasons:
  • The management had no concrete plan for solving the cost and ease-of-use problem, which was widely known to be very difficult, so success was unlikely.
  • If success were achieved, it would benefit all institutional repositories, including the potential commercial competitors. Benefiting the repositories of the institutions behind the DPN was its real goal.
Going in to the early discussions I didn't understand what the real goal was. At this point I need to confess that the focus on mitigating monoculture risk may have been my fault. If I recall correctly, I was the one who raised the issue. I hoped that the need for inter-operation among the institutions would motivate a second, independent implementation of the LOCKSS protocol. That would have both provided a well-proven basis for interoperability among the DPN nodes, and allowed LOCKSS to mitigate monoculture risk by using it for some of the LOCKSS boxes.

I didn't understand that the various institutional repositories saw LOCKSS not as a useful technology but as a competing system. To them, whatever solution emerged from the meetings it was important that LOCKSS not be part of it. Once I figured that out, there was no point in participating in further meetings.

Table 1
Heading Cost Percent
Development $2,782,693 39.7%
Operations $134,454 1.9%
Marketing $795,136 11.4%
Overhead $3,289,038 47.0%
Total $7,001,321

All told, DPN spent just over $7M. Table 1 shows where the money went. Note that Overhead and Marketing consumed almost 60% of the total spend.

Table 2
Institution R&D
AP Trust $375,097
Chronopolis $414,882
DuraSpace $581,610
Hathi Trust $314,018
Stanford $657,534
Texas $439,552

Table 2 shows where the R&D spending went, illustrating the distributed and site-specific nature of the development mandated by the diversity goal.

In my view, the key lesson to be learnt from the DPN Final Report is in this graph, from page 15. It shows that the vast majority of the per-TB cost of the system in operation was in overhead, not in actually preserving content. To be viable, the system would have had to preserve enormous amounts of data, while holding overhead costs constant. Of course, preserving vast amounts of data without increasing overhead would have needed much more efficient ingest mechanisms.

The key design principle of the LOCKSS Program from its birth in 1998 was to spend on hardware to minimize operational and overhead costs. As we wrote in 2003:
Minimizing the cost of participating in the LOCKSS system is essential to its success, so individual peers are built from low-cost, unreliable technology. A generic PC with three 180GB disks currently costs under $1000 and would preserve about 210 years of the largest journal we have found (the Journal of Biological Chemistry) for a worst-case hardware cost of less than $5 per journal/year.
Peers require little administration, relying on cooperation with other caches to detect and repair failures. There is no need for off-line backups on removable media. Creating these backups, and using them when readers request access to data, would involve excessive staff costs and latencies beyond a reader’s attention span.
The LOCKSS technology was often criticized for wasting disk since a system with many fewer copies could achieve the same theoretical reliability. The critics didn't understand the point made by the graph. The important thing to minimize is the thing that costs the most, which at scales less than Petabytes is never the hardware:
The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at [2014] prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk.
An interesting and constructive suggestion for future efforts is at the end of this Quartz piece.


Bryson said...

Very good overview of some of the reasons why the DPN was not successful. I think a very pertinent sentence from your post is "Benefiting the repositories of the institutions behind the DPN was its real goal." I don't have any reason to think that was intentional but that does seem like the way it was managed from afar. The failure of DPN was not a technical problem but a management problem in that they, as this post describes, did not address (or apparently even seriously attempt to address) the needs of the customers in a way that was better or cheaper than other solutions.

David. said...

And another one bites the dust. Via e-mail:

"The Keepers Registry, funded by Jisc, will end on 31 July 2019. From 1 August 2019, you will need to use an alternative service to monitor the archival status of e-journal content."

This is just more evidence that libraries look on long-term custodianship of the academic record as someone else's problem:

"in 2017/18 there were 249 sessions from 63 UK HEIs with around 50% of those HEIs only recording only one session per year. This usage by UK HEIs only represented 20% of all interactions with the service with 80% of activity being traced to users from outside of the UK. The continued allocation of funding to subsidise the delivery of a free international service is no longer well aligned with Jisc’s primary role as a provider of value-for-money digital services to its UK HE member organisations."

David. said...

And another one bites the dust. CDL had been running a pilot called the UC Data Network, but they just posted UC Data Network: Lessons Learned. The lesson learned is to outsource:

"While we have decided not to continue with the UCDN pilot, we now are in the position to leverage our lessons learned to move forward and achieve the original goals for the UCDN effort by focusing time and resources on our new Dryad partnership."