After two decades working on digital preservation in general, and archiving Web content in particular:
The most important lesson I've learned is that this is fundamentally an economic problem; we know how to do it but we don't want to pay enough to have it done.Jefferson's talk provided a lot of data, shown in the tables, to back up that assertion.
|Internet Archive Holdings|
|Hours of TV||3,000,000|
The Archive's recycled church in San Francisco, and its second site nearby sustain about 40Gb/s outbound and 20Gb/s inbound serving about 4M unique IPs/day. The Web collection alone serves over 800K users per day.
|Archive-It Partner Types|
|Archive-It Partners||Percent of total|
|Non-profit or NGO||5.67|
|Museum or Art Library||3.78|
|National Govt. Agency||3.55|
The significant participation of public libraries is relatively recent, the result of an IMLS- and IA-funded "Community Webs Project". It provides 27 public libraries with:
education, applied training, cohort network development, and web archiving servicesThe goal is to help public libraries build local history collections for the future.
|Budgets of 115 ARL Libraries|
|Annual Budget of||$M|
|All ARL Libraries||3,500|
|Avg. ARL Library||30|
|Avg. ARL Web archive||0.01225|
|Avg. EDU Web archive||0.00675|
|Web Archiving vs. Budget|
|Avg. ARL Acquisitions||9*10-6|
|Internet Archive Budget|
|Web Hardware per PB||0.24|
|IA Staff Salaries||11.0|
|IA Web Staff Salaries||2.2|
|Avg. ARL Total||30.0|
|Avg. ARL Salaries||5.8|
Adding hardware and staff, the Internet Archive spends each year on Web archiving $3,400,000, or 21% of the total. This is about 2.4 times as much as all the ARL libraries combined.
According to the 2017 NDSA Web Archiving Survey, by far the most common staffing level for Web archiving among libraries is 0.25 FTE. Staff whose majority time commitment is elsewhere will not be very productive in their Web archiving.
|Web Content Ingested Annually|
|Total ARL Archive-It (90)||95|
|Total .edu Archive-It (275)||160|
|Total non-.edu Archive-It (190)||120|
|Total LoC + NARA||325|
|Total Internet Archive||5,120|
What kind of content are libraries collecting from the Web? Overwhelmingly, they are self-documenting, collecting and preserving their institution's own Web presence. This is an important institutional function, but it contributes little to scholarship. There are, of course, many honorable exceptions, such as the Columbia University Human Rights Web Archive, and the New York Art Resources Consortium Web archive. But it is clear that the Web archiving priority for most University libraries is documenting their institution, further reducing the general utility of the limited funds and staff they devote to it.
Jefferson points out that:
If University libraries/archives spent 1% of their acquisitions budget on Web archiving, they could expand their preserved historical Web records by a multiple of 20x.A 1% tax is small compared to the annual increase in subscription package prices:
Once again, we analyzed the rate of price increase for more than 8,600 e-journal packages handled by EBSCO Information Services. For 2018, the average rate of increase was in the 4.7% to 5.3% range, up slightly from the 4.5% to 4.9% in 2017.That 1% would average $135,000 per ARL library, an increase of 11 times. Jefferson's 20x accounts for the fact that the increase would fund less fragmented, and thus more productive, staff and the bigger effect on smaller libraries. The ARL libraries alone would spend a total of $15,525,000, or 4.5 times as much as IA spends. Even assuming no increase in cost-effectiveness, the ARL libraries would be collecting more than a Petabyte a year. If they spent even 0.1% of the acquisitions budget it would still have a big impact.
Can we expect a diverse, inclusive, ethically-constructed archival record when spending 0.000x% of University budgets and 0.25 FTE on (mostly self) Web archiving?Clearly, the answer is no. If libraries are to fulfill their role as society’s memory, they need to divert a small fraction of what they spend on last century’s media to collecting and preserving this century’s media.
It would be good to update the numbers in this post, but that will have to wait. There are more recent numbers on subscription spending here, but I'm told that their accuracy is disputed. Pointers to other sources would be welcome.