Tuesday, July 26, 2022

The Internet Archive's "Long Tail" Program

In 2018 I helped the Internet Archive get a two-year Mellon Foundation grant aimed at preserving the "long tail" of academic literature from small publishers, which is often at great risk of loss. In 2020 I wrote The Scholarly Record At The Internet Archive explaining the basic idea:
The project takes two opposite but synergistic approaches:
  • Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
  • Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.
Below the fold I report on subsequent developments in this project.

The results from top-down part included the beta of the fatcat wiki search engine based on the collected metadata, and from the bottom-up part an alpha version of a reasonably fast and accurate machine learning classifier that could identify academic articles.

This was enough progress that the Mellon Foundation awarded a second two-year grant, which has now concluded. One primary result was the launch of IA Scholar:
This service provides fulltext searching over research publications archived in Internet Archive's various collections. It includes content from the natural sciences, humanities, biomedicine, art, history, industrial research, government reports, and more.

Reader access to the content is provided when possible. Sometimes this access is to a "pre-print" or other version of the work, and this is indicated in the search results. In other cases, depending on search filters, results are included for which there is only a bibliographic catalog entry. It may still be possible to obtain access through a public library or from the publisher directly.
The difference between this and simple full-text search comes from the metadata index technology from the first grant. So, for example, you can:
  • copy and paste a citation into the search box, and the system will parse it.
  • use search filters such as year:<2000 or type:paper-conference
Another was Internet Archive Releases Refcat, the IA Scholar Index of over 1.3 Billion Scholarly Citations:
As part of our ongoing efforts to archive and provide perpetual access to at-risk, open-access scholarship, we have released Refcat (“reference” + “catalog”), the citation index culled from the catalog that underpins our IA Scholar service for discovering the scholarly literature and research outputs within Internet Archive. This first release of the Refcat dataset contains over 1.3 billion citations extracted from over 60 million metadata records and over 120 million scholarly artifacts (articles, books, datasets, proceedings, code, etc) that IA Scholar has archived through web harvesting, digitization, integrations with other open knowledge services, and through partnerships and joint initiatives.
I tried IA Scholar. First, using Chromium, I pasted this citation into the search box:
David S.H. Rosenthal, and Daniel Vargas. “Distributed Digital Preservation in the Cloud”, International Digital Curation Conference, Amsterdam, Netherlands, January 2013
The system returned a single correct hit in 0.17sec with the start of the abstract and a DOI in the center column and a cornucopia of correct information in the right column, both somewhat marred because they included much raw HTML.

Second, I tried this query:
author:"david s h rosenthal" author:"Vicky Reich"
The system returned six hits in 0.59sec, this time correctly formatted. The screenshot shows the top three results. Alas, the second has been confused by two papers with the same title:
  1. Vicky Reich & David S.H. Rosenthal. “LOCKSS (Lots Of Copies Keep Stuff Safe)”, Presented at Preservation 2000: An International Conference on the Preservation and Long Term Accessibility of Digital Materials, December 7-8, 2000, York, England. Also published in The New Review of Academic Librarianship, vol. 6, no. 1, 2000, pp. 155-161. doi:10.1080/13614530009516806
  2. David S.H. Rosenthal. “LOCKSS: Lots Of Copies Keep Stuff Safe”, Presented at the NIST Digital Preservation Interoperability Framework Workshop, March 29-31, 2010, Gaithersburg, MD.
The system correctly found paper #1 at The New Review of Academic Librarianship with the DOI. But it also found paper #2 in the Wayback Machine, collected from the LOCKSS Program's wiki. The titles and one of the authors matched, but the dates were a decade apart and the primary author of paper #1 was missing from paper #2, so matching needs some work. Perhaps greater weight should be given to the copyright dates. But to the system's credit, it found both a somewhat obscure formal paper, and a really obscure conference paper that was simply posted to the project's wiki. And it didn't have any false positives.

No comments: