The DPN started in 2012 and:
it was anticipated that there would be different nodes specializing in different types of content (e.g., text, data and moving images) and providing replication, audit, succession etc. at the bit level across the nodes; and 2) relatedly, the goal was to start at the most basic level (i.e., bit-level preservation with audit and succession) and then start working up the stack of services that are involved in full-blown digital preservationWhat was the landscape of digital preservation back in 2012 that motivated the DPN? A year earlier, I had written A Brief History of E-Journal Preservation. Referring to it, we see that by 2012:
- The LOCKSS Program was 14 years old, had been in production use 8 years, and had been economically self-sustaining for 5 years. There had been three designs and two complete implementations of the protocol by which LOCKSS boxes communicated, the last of which was based on award-winning computer science research.
- Portico had been in production for 8 years but initially it:
failed to achieve economic sustainability on its own. As Bill Bowen said discussing the Blue Ribbon Task Force Report:
"it has been more challenging for Portico to build a sustainable model than parts of the report suggest."Libraries proved unwilling to pay enough to cover its costs. It was folded into a single organization with JSTOR, in whose $50M+ annual cash flow Portico's losses could be buried.
E-journal preservation economics were based on protecting institutions’ investment in expensive subscription content. Elsewhere, things were less sustainable. Institutional repositories contained little, and what they did was not very important. The reason was that getting stuff into them was too hard and costly.
As I wrote in my initial DPN post:
Each of the libraries represented had made significant investments in establishing an institutional repository, which was under-utilized due to the difficulty of persuading researchers to deposit materials. With the video collection out of the picture as too expensive, the librarians seized on diversity as the defense against the monoculture threat to preservation. In my view there were two main reasons:Of course, the diversity goal also meant that the DPN was an add-on to their existing institutional repositories. A hypothetical converged system would have been a threat to them.
Alas, this meant that the founders' incentives were not aligned with their customers'.
- Replicating pre-ingested content from other institutions was a quicker and easier way to increase the utilization of their repository than educating faculty.
- Jointly marketing a preservation service that, through diversity, would be more credible than those they could offer individually was a way of transferring money from other libraries' budgets to their repositories' budgets.
The DPN’s pitch to customers was, in effect, that it would be a better institutional repository than one they ran themselves. Making the economics of "institutional repository as a service" sustainable required greatly improving the ingest process at each node for the content type in which it specialized. That was what would determine the operational expenses, and thus the prices the DPN needed to charge. Doing so posed major:
- design problems, because metadata for the content was not standardized between the submitting institutions (unlike the fairly standard e-journal metadata),
- implementation problems, because there were no off-the shelf solutions, and
- cost problems, because this required site- and content-type-specific development, not development shared between the nodes.
In December of 2014, Dave Pcolar was hired as the Chief Technical Officer and with his leadership and direction, a consensus was reached on the best approach to develop the network.The consensus was that the nodes would export a custom REST API. Because diversity was the whole point of the DPN, each node had to implement both the server and client sides of the API to integrate with their existing repository infrastructure. Pretty much the only shared implementation effort was the API specification. Which, of course, is what the diversity goal was intended to achieve.
The problem was that the participating institutional repositories were uneconomic and mostly empty. It could not be solved without making the ingest process much cheaper and easier. After all, someone was going to have to do the work and pay the cost of ingest. Not realizing this was a major management failure. As the final report shows, the customers told them that this was the requirement:
institutions repeatedly stated that they did not have a good workflow for digital preservation. Many institutions said that they did not have sufficient in-depth knowledge of their digital collections to manage them for long-term preservation. Local systems for managing content did not have a built-in “export to DPN” function and this presented a problem of how to prepare and move the content for deposit into DPN.But that wasn't the real management failure. It was true that diversity improved the network's robustness against hypothetical future attacks and failures. The fundamental management failure was not to appreciate that, in return for this marginal future benefit, diversity immediately guaranteed that the product they had to offer would be more expensive and take longer to build, be more expensive to operate and maintain, and be more complex and thus less reliable than a centralized commercial competitor. Several of which duly arrived in the market before DPN did.
Indeed, a year before DPN started the commercial pioneer of outsourced institutional repositories, bepress, was already focused on this area. There was clearly a market for outsourcing institutional repositories. By 2017, bepress had:
more than 500 participating institutions, predominantly US colleges and universities. bepress claims a US market share of approximately 50% overall, recognizing that not all institutions have an institutional repository. Among those universities that conduct the greatest amount of research, for example the 115 US universities with highest research activity, bepress lists 34 as Digital Commons participants, for a market share of about 30%.DPN management should have been aware of the potential competition. They could have reviewed the problem at the start, saying to the sponsoring institutions:
The diversity thing isn't going to be viable. What the world needs is a major improvement in the cost and ease-of-use of institutional repository ingest. Why don't we spend the money on that instead?Unfortunately, this wouldn't have worked for two main reasons:
- The management had no concrete plan for solving the cost and ease-of-use problem, which was widely known to be very difficult, so success was unlikely.
- If success were achieved, it would benefit all institutional repositories, including the potential commercial competitors. Benefiting the repositories of the institutions behind the DPN was its real goal.
I didn't understand that the various institutional repositories saw LOCKSS not as a useful technology but as a competing system. To them, whatever solution emerged from the meetings it was important that LOCKSS not be part of it. Once I figured that out, there was no point in participating in further meetings.
All told, DPN spent just over $7M. Table 1 shows where the money went. Note that Overhead and Marketing consumed almost 60% of the total spend.
Table 2 shows where the R&D spending went, illustrating the distributed and site-specific nature of the development mandated by the diversity goal.
The key design principle of the LOCKSS Program from its birth in 1998 was to spend on hardware to minimize operational and overhead costs. As we wrote in 2003:
Minimizing the cost of participating in the LOCKSS system is essential to its success, so individual peers are built from low-cost, unreliable technology. A generic PC with three 180GB disks currently costs under $1000 and would preserve about 210 years of the largest journal we have found (the Journal of Biological Chemistry) for a worst-case hardware cost of less than $5 per journal/year.The LOCKSS technology was often criticized for wasting disk since a system with many fewer copies could achieve the same theoretical reliability. The critics didn't understand the point made by the graph. The important thing to minimize is the thing that costs the most, which at scales less than Petabytes is never the hardware:
Peers require little administration, relying on cooperation with other caches to detect and repair failures. There is no need for off-line backups on removable media. Creating these backups, and using them when readers request access to data, would involve excessive staff costs and latencies beyond a reader’s attention span.
The peer-to-peer architecture of the LOCKSS system is unusual among digital preservation systems for a specific reason. The goal of the system was to preserve published information, which one has to assume is covered by copyright. One hour of a good copyright lawyer will buy, at  prices, about 12TB of disk, so the design is oriented to making efficient use of lawyers, not making efficient use of disk.An interesting and constructive suggestion for future efforts is at the end of this Quartz piece.