Thursday, May 2, 2019

Lets Put Our Money Where Our Ethics Are

I found a video of Jefferson Bailey's talk at the Ethics of Archiving the Web conference from a year ago. It was entitled Lets Put Our Money Where Our Ethics Are. The talk is the first 18.5 minutes of this video. It focused on the paucity of resources devoted to archiving the huge proportion of our culture that now lives on the evanescent Web. I've also written on this topic, for example in Pt. 2 of The Amnesiac Civilization. Below the fold, some detailed numbers (that may by now be somewhat out-of-date) and their implications.

After two decades working on digital preservation in general, and archiving Web content in particular:
The most important lesson I've learned is that this is fundamentally an economic problem; we know how to do it but we don't want to pay enough to have it done.
Jefferson's talk provided a lot of data, shown in the tables, to back up that assertion.

Internet Archive Holdings
Type Number
Software Titles 25,000
Moving Images 2,000,000
Scanned Books 2,300,000
Audio Files 2,400,000
Hours of TV 3,000,000
eBooks 4,000,000
URLs 635,000,000,000
The Internet Archive is a global resource, comparable to, and with a much bigger audience than, national libraries. It has been for many years in the top 300 Web sites in the world. For comparison, the Library of Congress typically ranks between 4000 and 6000. It has a total of well over 100PB of storage, over 40PB of unique data, of which about half is Web data. The table summarizes the categories into which its holdings are organized.

The Archive's recycled church in San Francisco, and its second site nearby sustain about 40Gb/s outbound and 20Gb/s inbound serving about 4M unique IPs/day. The Web collection alone serves over 800K users per day.

The Archive's Web content comes from many sources, including their own global crawls, domain crawls for national libraries, partnerships with search engines, and crowd-sourcing from "Save page now", but Jefferson's talk focused on the curated collections ingested via their Archive-It service (Ars Technica's Nathan Mattise has the big picture view from the Wayback Machine's Mark Graham).

Archive-It Partner Types
Archive-It Partners Percent of total
College/University Library 59.34
Public Library 7.33
Non-profit or NGO 5.67
State Library/Archive 5.67
Museum or Art Library 3.78
National Govt. Agency 3.55
State/Local Govt. 3.55
Law Library 2.36
Religious Organization 2.36
National Library/Archive 1.66
Historical Society 1.42
K-12 School 1.18
Archive-It serves about 600 partners of many different types, as shown in the table. But about 60% of them are college or University libraries or archives, which is where Jefferson's talk focused.

The significant participation of public libraries is relatively recent, the result of an IMLS- and IA-funded "Community Webs Project". It provides 27 public libraries with:
education, applied training, cohort network development, and web archiving services
The goal is to help public libraries build local history collections for the future.

Budgets of 115 ARL Libraries
Annual Budget of $M
All ARL Libraries 3,500
Avg. ARL Library 30
Avg. Acquisitions 13.5
Avg. Subscriptions 8.4
Avg. ARL Web archive 0.01225
Avg. EDU Web archive 0.00675
This table starts Jefferson's comparison of the annual budgets of ARL libraries (major US University libraries) with the IA's. Acquisitions is their total spend on acquiring content, subscriptions is the part they spend on academic journals and databases. Compare this with the amounts the average ARL and non-ARL University library spends on Web archiving.

Web Archiving vs. Budget
Fraction of Fraction
Avg. ARL Acquisitions 9*10-6
Total Acquisitions 4*10-6
Subscriptions 8*10-6
This table shows the fractions that Web archiving represents of the average ARL library total spending, spending on acquisitions, and spending on subscriptions. They are all less than one one-hundred-thousand-th, which is rather a small amoount.

Internet Archive Budget
Expense $M
Total 16.0
Hardware 2.4
Web Hardware 1.2
Web Hardware per PB 0.24
For comparison, this table details the Internet Archive budget. It is about half the size of the average ARL library, and about 3% of the budget of the Library of Congress, despite serving a vastly greater audience than any of them. Web archiving consumes about half the storage, so consumes about half the hardware spending.

Institutional Budget
Expense $M
IA Total 16.0
IA Staff Salaries 11.0
IA Web Staff Salaries 2.2
Avg. ARL Total 30.0
Avg. ARL Salaries 5.8
Of course, much of each institution's spending goes to staff. Web archiving consumes only a fifth of IA's staff spending. This is to be expected, activities like book scanning, film digitization, and so on are much less automated.

Adding hardware and staff, the Internet Archive spends each year on Web archiving $3,400,000, or 21% of the total. This is about 2.4 times as much as all the ARL libraries combined.

According to the 2017 NDSA Web Archiving Survey, by far the most common staffing level for Web archiving among libraries is 0.25 FTE. Staff whose majority time commitment is elsewhere will not be very productive in their Web archiving.

Web Content Ingested Annually
Source TB
Total ARL Archive-It (90) 95
Total .edu Archive-It (275) 160
Total Archive-It (190) 120
Total LoC + NARA 325
Total Internet Archive 5,120
This table shows how much data the libraries get for their money. The Archive-It partners ingest each year about 280TB. Archive-It is 5.5% (280/5120) of the bytes the Internet Archive ingests from the Web each year.

What kind of content are libraries collecting from the Web? Overwhelmingly, they are self-documenting, collecting and preserving their institution's own Web presence. This is an important institutional function, but it contributes little to scholarship. There are, of course, many honorable exceptions, such as the Columbia University Human Rights Web Archive, and the New York Art Resources Consortium Web archive. But it is clear that the Web archiving priority for most University libraries is documenting their institution, further reducing the general utility of the limited funds and staff they devote to it.

Jefferson points out that:
If University libraries/archives spent 1% of their acquisitions budget on Web archiving, they could expand their preserved historical Web records by a multiple of 20x.
A 1% tax is small compared to the annual increase in subscription package prices:
Once again, we analyzed the rate of price increase for more than 8,600 e-journal packages handled by EBSCO Information Services. For 2018, the average rate of increase was in the 4.7% to 5.3% range, up slightly from the 4.5% to 4.9% in 2017.
That 1% would average $135,000 per ARL library, an increase of 11 times. Jefferson's 20x accounts for the fact that the increase would fund less fragmented, and thus more productive, staff and the bigger effect on smaller libraries. The ARL libraries alone would spend a total of $15,525,000, or 4.5 times as much as IA spends. Even assuming no increase in cost-effectiveness, the ARL libraries would be collecting more than a Petabyte a year. If they spent even 0.1% of the acquisitions budget it would still have a big impact.

Jefferson asks:
Can we expect a diverse, inclusive, ethically-constructed archival record when spending 0.000x% of University budgets and 0.25 FTE on (mostly self) Web archiving?
Clearly, the answer is no. If libraries are to fulfill their role as society’s memory, they need to divert a small fraction of what they spend on last century’s media to collecting and preserving this century’s media.

It would be good to update the numbers in this post, but that will have to wait. There are more recent numbers on subscription spending here, but I'm told that their accuracy is disputed. Pointers to other sources would be welcome.


Unknown said...

Thanks for this, and for the shout out to our Human Rights Web Archive. I think the greatest challenge to seeing a regularized use of web collecting as a part of collection development are cultural, not technical. We need to step outside of the "archives" frameworks that are still dominant in web archiving and into the "collection development" framework. I was at the EAW presentation and commented that we aren't competing for collection development dollars, since we're not in the same ring, but we are competing with digital library, digital humanities and digital preservation programs, which are all drawing on the skill sets and infrastructure that are needed for web collecting. Not to mention the investment we need to make in creating better access and more meaning research experiences using web archives. Nonetheless this is a very helpful exercise in putting things into perspective.

Unknown said...

Forgot to sign my name to the above comment -- Pamela Graham, Columbia University.

Lincolnarchives said...

There are several reasons to slow down and consider the digital world. Technology moves so rapidly, that content management systems, digital formats, and catalogs have a lifespan of possibly a decade before most of the foundation needs to be completely redone, because the technology is no longer used, i.e. shtml, music, flash, etc. Digitizing and save .tifs as uncompressed images has a far better chance of having a larger shelf life, then jpeg2000, pdf,and jpg, using compression. Even the Internet Archives came to the conclusion that paper was a solid solution for backup. And there is shifting the paradigm of microfilm, utilizing it for a "storage" medium, transferring high resolution color images to microfilm, providing a high resolution, in color. Digitizing is a money pit, always requiring money to either upgrade, or migrate. NARA has 15 billion records, and by old service bureau prices at .15 per scan, it will still cost $9 trillion dollars just to scan records. And that doesn't even count the cost of building a content management system, that is user friendly. In 2006-7 the ERA system cost over half a million dollars. Is Congress actually ever going to provide funding for digitizing, when they aren't even properly funding the physical Archives? Digitizing is like asking Congress to build an exact replica of the Grand Canyon on the East Coast, constantly making sure that it remains an exact replica. For a third of the money to digitize, they could build a large new East Coast regional facility, hire additional archival, preservation, to do the job of inventory the records, flattening the records. There are independent researchers to digitize records that people need. Karen Needles, Lincoln Archives Digital Project