The project takes two opposite but synergistic approaches:
- Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
- Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.
Top-downThe top-down approach is obviously the easier of the two, and has already resulted in fatcat.wiki, a fascinating search engine now in beta:
Fatcat is a versioned, user-editable catalog of research publications including journal articles, conference proceedings, and datasetsIn contrast to the Wayback Machine's URL search, Fatcat allows searches for:
Features include archival file-level metadata (verified digests and long-term copies), an open, documented API, and work/release indexing (eg, distinguishing between and linking pre-prints, manuscripts, and version-of-record).
- A Release, a published version of a Work, for example a journal article, a pre-print, or a book. Search terms include a DOI.
- A Container, such as a journal or a serial. Search terms include an ISSN.
- A Creator, such as an author. Search terms include an author name or an ORCID.
- A File, based on a hash.
- A File Set, such as a dataset.
- A Work, such as an article, its pre-print, its datasets, etc.
- Articles from 1985 and 2011 co-authored by David S. Rosenthal, an Australian oncologist.
- A 1955 paper on pneumonia by S. David Sternberg and Joseph H. Rosenthal.
- Papers from 1970, 1971 and 1972 whose authors include H. David and S. Rosenthal.
- Psychology papers from 1968, 1975 and 1976 co-authored by David Rosenthal with no middle initial.
That is where the "wiki" part of fatcat.wiki comes in. FatCat is user-editable, able to support Wikipedia-style crowd-sourcing of bibliographic metadata. It is too soon to tell how effective this will be, but it is an attempt to address one of the real problems of the "long tail", that the metadata is patchy and low-quality.
Bottom-upThe bottom-up approach is much harder. At the start of the project the Internet Archive estimated that the Wayback Machine had perhaps 650M PDFs, and about 6% of them were academic papers. Whether a PDF is an academic paper is only obvious even to humans some of the time, and it is likely that the PDFs that aren't obvious include much of the academic literature that is at significant risk of loss.
The grant supported Paul Baclace to use machine learning techniques to build a classifier capable of distinguishing academic articles from other PDFs in the Internet Archive's collection. As he describes in Making A Production Classifier Ensemble, he faced two big problems. First, cost:
Internet Archive presented me with this situation in early 2019: it takes 30 seconds to determine whether a PDF is a research work, but it needs to be less than 1 second. It will be used on PDFs already archived and on PDFs as they are found during a Web crawl. Usage will be batch, not interactive, and it must work without GPUs. Importantly, it must work with multiple languages for which there are not many training examples.Second, accuracy based on the available training sets:
The positive training set is a known collection of 18 million papers which can be assumed to be clean, ground truth. The negative training set is simply a random selection from the 150 million PDFs which is polluted with positive cases. An audit of 1,000 negative examples showed 6–7% positive cases. These were cleaned up by manual inspection and bootstrapping was used to increase the size of the set.Since some PDFs contained an image of the text rather than the text itself, Baclace started by creating an image of the first page:
The image produced is actually a thumbnail. This works fine because plenty of successful image classification is done on small images all the time. This has a side benefit that the result is language independent: at 224x224, it is not possible to discern the language used in the document.The image was run through an image classifier. Then the text was extracted:
After much experimentation, the pdftotext utility was found to be fast and efficient for text. Moreover, because it is run as a subprocess, a timeout is used to avoid getting stuck on parasitic cases. When exposing a system to millions of examples, bad apple PDFs which cause the process to be slow or stuck are possible.How good did the text classification have to be?
The classic computer science paper has the words “abstract” and “references”. How well would it work to simply require both of these words? I performed informal measurements and found that requiring both words had only 50% accuracy, which is dismal. Requiring either keyword has 84% accuracy for the positive case. Only 10% of the negative cases had either of these keywords. This gave me an accuracy target to beat.Baclace used two text classifiers. First, fastText from Facebook:
It’s a well-written C++ package with python bindings that makes no compromises when it comes to speed. “Fast” means 1msec., which is 2 orders of magnitude faster than a deep neural network on a CPU. The accuracy on a 20,000 document training and test set reached 96%. One of the big advantages is that it is full text, but a big disadvantage is that this kind of bag-of-words classifier does no generalization to languages with little or no training examples.Second, BERT:
By now, everyone knows that the BERT model was a big leap ahead for “self-supervised” natural language. The multilingual model was trained with over 100 languages and it is a perfect choice for this project.
Using the classifier mode for BERT, the model was fine-tuned on a 20,000 document training and test set. The result was 98% accuracy.
Each model described above computes a classification confidence. This makes it possible to create an ensemble classifier that selectively uses the models to emphasize speed or accuracy. For a speed example, if text is available and the fastText linear classifier confidence is high, BERT could be skipped. To emphasize accuracy, all three models can be run and then the combined confidence values and classification predictions make an overall classification.The result is a fast and reasonably accurate classifier. Its output can be fed to standard tools for extracting bibliographic metadata from the PDFs of academic articles, and the results of this added to FatCat's index.
SynergyNow the Internet Archive has two independent ways to find open access academic articles in the "long tail":
- by looking for them where the metadata says they should be,
- and by looking at the content collect by their normal Web crawling and by use of "Save Page Now",