The project takes two opposite but synergistic approaches:Below the fold I report on subsequent developments in this project.
- Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
- Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.
The results from top-down part included the beta of the fatcat wiki search engine based on the collected metadata, and from the bottom-up part an alpha version of a reasonably fast and accurate machine learning classifier that could identify academic articles.
This was enough progress that the Mellon Foundation awarded a second two-year grant, which has now concluded. One primary result was the launch of IA Scholar:
This service provides fulltext searching over research publications archived in Internet Archive's various collections. It includes content from the natural sciences, humanities, biomedicine, art, history, industrial research, government reports, and more.The difference between this and simple full-text search comes from the metadata index technology from the first grant. So, for example, you can:
Reader access to the content is provided when possible. Sometimes this access is to a "pre-print" or other version of the work, and this is indicated in the search results. In other cases, depending on search filters, results are included for which there is only a bibliographic catalog entry. It may still be possible to obtain access through a public library or from the publisher directly.
- copy and paste a citation into the search box, and the system will parse it.
- use search filters such as year:<2000 or type:paper-conference
As part of our ongoing efforts to archive and provide perpetual access to at-risk, open-access scholarship, we have released Refcat (“reference” + “catalog”), the citation index culled from the catalog that underpins our IA Scholar service for discovering the scholarly literature and research outputs within Internet Archive. This first release of the Refcat dataset contains over 1.3 billion citations extracted from over 60 million metadata records and over 120 million scholarly artifacts (articles, books, datasets, proceedings, code, etc) that IA Scholar has archived through web harvesting, digitization, integrations with other open knowledge services, and through partnerships and joint initiatives.I tried IA Scholar. First, using Chromium, I pasted this citation into the search box:
David S.H. Rosenthal, and Daniel Vargas. “Distributed Digital Preservation in the Cloud”, International Digital Curation Conference, Amsterdam, Netherlands, January 2013The system returned a single correct hit in 0.17sec with the start of the abstract and a DOI in the center column and a cornucopia of correct information in the right column, both somewhat marred because they included much raw HTML.
Second, I tried this query:
author:"david s h rosenthal" author:"Vicky Reich"The system returned six hits in 0.59sec, this time correctly formatted. The screenshot shows the top three results. Alas, the second has been confused by two papers with the same title:
- Vicky Reich & David S.H. Rosenthal. “LOCKSS (Lots Of Copies Keep Stuff Safe)”, Presented at Preservation 2000: An International Conference on the Preservation and Long Term Accessibility of Digital Materials, December 7-8, 2000, York, England. Also published in The New Review of Academic Librarianship, vol. 6, no. 1, 2000, pp. 155-161. doi:10.1080/13614530009516806
- David S.H. Rosenthal. “LOCKSS: Lots Of Copies Keep Stuff Safe”, Presented at the NIST Digital Preservation Interoperability Framework Workshop, March 29-31, 2010, Gaithersburg, MD.