After two decades working on digital preservation in general, and archiving Web content in particular:
The most important lesson I've learned is that this is fundamentally an economic problem; we know how to do it but we don't want to pay enough to have it done.Jefferson's talk provided a lot of data, shown in the tables, to back up that assertion.
Internet Archive Holdings | |
---|---|
Type | Number |
Software Titles | 25,000 |
Moving Images | 2,000,000 |
Scanned Books | 2,300,000 |
Audio Files | 2,400,000 |
Hours of TV | 3,000,000 |
eBooks | 4,000,000 |
URLs | 635,000,000,000 |
The Archive's recycled church in San Francisco, and its second site nearby sustain about 40Gb/s outbound and 20Gb/s inbound serving about 4M unique IPs/day. The Web collection alone serves over 800K users per day.
Source |
Archive-It Partner Types | |
---|---|
Archive-It Partners | Percent of total |
College/University Library | 59.34 |
Public Library | 7.33 |
Non-profit or NGO | 5.67 |
State Library/Archive | 5.67 |
Museum or Art Library | 3.78 |
National Govt. Agency | 3.55 |
State/Local Govt. | 3.55 |
Law Library | 2.36 |
Religious Organization | 2.36 |
National Library/Archive | 1.66 |
Historical Society | 1.42 |
K-12 School | 1.18 |
The significant participation of public libraries is relatively recent, the result of an IMLS- and IA-funded "Community Webs Project". It provides 27 public libraries with:
education, applied training, cohort network development, and web archiving servicesThe goal is to help public libraries build local history collections for the future.
Budgets of 115 ARL Libraries | |
---|---|
Annual Budget of | $M |
All ARL Libraries | 3,500 |
Avg. ARL Library | 30 |
Avg. Acquisitions | 13.5 |
Avg. Subscriptions | 8.4 |
Avg. ARL Web archive | 0.01225 |
Avg. EDU Web archive | 0.00675 |
Web Archiving vs. Budget | |
---|---|
Fraction of | Fraction |
Avg. ARL Acquisitions | 9*10-6 |
Total Acquisitions | 4*10-6 |
Subscriptions | 8*10-6 |
Internet Archive Budget | |
---|---|
Expense | $M |
Total | 16.0 |
Hardware | 2.4 |
Web Hardware | 1.2 |
Web Hardware per PB | 0.24 |
Institutional Budget | |
---|---|
Expense | $M |
IA Total | 16.0 |
IA Staff Salaries | 11.0 |
IA Web Staff Salaries | 2.2 |
Avg. ARL Total | 30.0 |
Avg. ARL Salaries | 5.8 |
Adding hardware and staff, the Internet Archive spends each year on Web archiving $3,400,000, or 21% of the total. This is about 2.4 times as much as all the ARL libraries combined.
According to the 2017 NDSA Web Archiving Survey, by far the most common staffing level for Web archiving among libraries is 0.25 FTE. Staff whose majority time commitment is elsewhere will not be very productive in their Web archiving.
Web Content Ingested Annually | |
---|---|
Source | TB |
Total ARL Archive-It (90) | 95 |
Total .edu Archive-It (275) | 160 |
Total non-.edu Archive-It (190) | 120 |
Total LoC + NARA | 325 |
Total Internet Archive | 5,120 |
What kind of content are libraries collecting from the Web? Overwhelmingly, they are self-documenting, collecting and preserving their institution's own Web presence. This is an important institutional function, but it contributes little to scholarship. There are, of course, many honorable exceptions, such as the Columbia University Human Rights Web Archive, and the New York Art Resources Consortium Web archive. But it is clear that the Web archiving priority for most University libraries is documenting their institution, further reducing the general utility of the limited funds and staff they devote to it.
Jefferson points out that:
If University libraries/archives spent 1% of their acquisitions budget on Web archiving, they could expand their preserved historical Web records by a multiple of 20x.A 1% tax is small compared to the annual increase in subscription package prices:
Once again, we analyzed the rate of price increase for more than 8,600 e-journal packages handled by EBSCO Information Services. For 2018, the average rate of increase was in the 4.7% to 5.3% range, up slightly from the 4.5% to 4.9% in 2017.That 1% would average $135,000 per ARL library, an increase of 11 times. Jefferson's 20x accounts for the fact that the increase would fund less fragmented, and thus more productive, staff and the bigger effect on smaller libraries. The ARL libraries alone would spend a total of $15,525,000, or 4.5 times as much as IA spends. Even assuming no increase in cost-effectiveness, the ARL libraries would be collecting more than a Petabyte a year. If they spent even 0.1% of the acquisitions budget it would still have a big impact.
Jefferson asks:
Can we expect a diverse, inclusive, ethically-constructed archival record when spending 0.000x% of University budgets and 0.25 FTE on (mostly self) Web archiving?Clearly, the answer is no. If libraries are to fulfill their role as society’s memory, they need to divert a small fraction of what they spend on last century’s media to collecting and preserving this century’s media.
It would be good to update the numbers in this post, but that will have to wait. There are more recent numbers on subscription spending here, but I'm told that their accuracy is disputed. Pointers to other sources would be welcome.
Thanks for this, and for the shout out to our Human Rights Web Archive. I think the greatest challenge to seeing a regularized use of web collecting as a part of collection development are cultural, not technical. We need to step outside of the "archives" frameworks that are still dominant in web archiving and into the "collection development" framework. I was at the EAW presentation and commented that we aren't competing for collection development dollars, since we're not in the same ring, but we are competing with digital library, digital humanities and digital preservation programs, which are all drawing on the skill sets and infrastructure that are needed for web collecting. Not to mention the investment we need to make in creating better access and more meaning research experiences using web archives. Nonetheless this is a very helpful exercise in putting things into perspective.
ReplyDeleteForgot to sign my name to the above comment -- Pamela Graham, Columbia University.
ReplyDeleteThere are several reasons to slow down and consider the digital world. Technology moves so rapidly, that content management systems, digital formats, and catalogs have a lifespan of possibly a decade before most of the foundation needs to be completely redone, because the technology is no longer used, i.e. shtml, music, flash, etc. Digitizing and save .tifs as uncompressed images has a far better chance of having a larger shelf life, then jpeg2000, pdf,and jpg, using compression. Even the Internet Archives came to the conclusion that paper was a solid solution for backup. And there is shifting the paradigm of microfilm, utilizing it for a "storage" medium, transferring high resolution color images to microfilm, providing a high resolution, in color. Digitizing is a money pit, always requiring money to either upgrade, or migrate. NARA has 15 billion records, and by old service bureau prices at .15 per scan, it will still cost $9 trillion dollars just to scan records. And that doesn't even count the cost of building a content management system, that is user friendly. In 2006-7 the ERA system cost over half a million dollars. Is Congress actually ever going to provide funding for digitizing, when they aren't even properly funding the physical Archives? Digitizing is like asking Congress to build an exact replica of the Grand Canyon on the East Coast, constantly making sure that it remains an exact replica. For a third of the money to digitize, they could build a large new East Coast regional facility, hire additional archival, preservation, to do the job of inventory the records, flattening the records. There are independent researchers to digitize records that people need. Karen Needles, Lincoln Archives Digital Project
ReplyDelete