Thursday, July 25, 2019

Carl Malamud's Text Mining Project

For many years now it has been obvious that humans can no longer effectively process the enormous volume of academic publishing. The entire system is overloaded, and its signal-to-noise ratio is degrading. Journals are no longer effective gatekeepers, indeed many are simply fraudulent. Peer review is incapable of preventing fraud, gross errors, false authorship, and duplicative papers; reviewers cannot be expected to have read all the relevant literature.

On the other hand, there is now much research showing that computers can be effective at processing this flood of information. Below the fold I look at a couple of recent developments.

An example of the power of text-mining the literature is this paper from the Lawrence Berkeley lab, Unsupervised Word Embeddings Capture Latent Knowledge from Materials Science Literature by Vahe Tshitoyan et al (published July 4th; paywalled but there is a viewable, no-download version of the paper here). From their abstract:
Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
Much of the previous research used supervised learning, training the system to recognize significance on human-collated sets of significant and insignificant papers. Tshitoyan et al used unsupervised learning, with no human input as to significance within a corpus of materials science papers. The key finding of the paper is that that, despite the minimal human input, their system was capable of predicting future discoveries:
Figure 3a
Finally, we tested whether our model—if trained at various points in the past—would have correctly predicted thermoelectric materials reported later in the literature. Specifically, we generated 18 different ‘historical’ text corpora consisting only of abstracts published before cutoff years between 2001 and 2018. We trained separate word embeddings for each historical dataset, and used these embeddings to predict the top 50 thermoelectrics that were likely to be reported in future (test) years. For every year past the date of prediction, we tabulated the cumulative percentage of predicted thermoelectric compositions that were reported in the literature alongside a thermoelectric keyword. Figure 3a depicts the result from each such ‘historical’ dataset as a thin grey line. For example, the light grey line labelled ‘2015’ depicts the percentage of the top 50 predictions made using the model trained only on scientific abstracts published before 1 January 2015, and that were subsequently reported in the literature alongside a thermoelectric keyword after one, two, three or four years (that is, the years 2015–2018). Overall, our results indicate that materials from the top 50 word embedding-based predictions (red line) were on average eight times more likely to have been studied as thermoelectrics within the next five years as compared to a randomly chosen unstudied material from our corpus at that time (blue) and also three times more likely than a random material with a non-zero DFT bandgap (green). The use of larger corpora that incorporate data from more recent years improved the rate of successful predictions, as indicated by the steeper gradients for later years in Fig. 3a.
The corpus of materials science papers was created by downloading from Elsevier and Springer via their separate APIs. Leaving aside the cost and legal difficulties in accessing material this way, the publishers' stranglehold on the literature imposes additional work on researchers, and restricts input to papers they have acquired.

To be truly effective, text-mining needs a single API through which the text of all papers is accessible; a global database of the academic literature. Priyanka Pulla at Nature reports on Carl Malamud's significant step in that direction in The plan to mine the world’s research papers:
Over the past year, Malamud has — without asking publishers — teamed up with Indian researchers to build a gigantic store of text and images extracted from 73 million journal articles dating from 1847 up to the present day. The cache, which is still being created, will be kept on a 576-terabyte storage facility at Jawaharlal Nehru University (JNU) in New Delhi.
No one will be allowed to read or download work from the repository, because that would breach publishers’ copyright. Instead, Malamud envisages, researchers could crawl over its text and data with computer software, scanning through the world’s scientific literature to pull out insights without actually reading the text.
Tip of the hat to Glyn Moody, who explains the significance of India in this context:
India was chosen because of an important court battle that concluded two years ago. As Techdirt reported then, it is legal in India to make photocopies of copyright material in an educational context. Malamud's contention is that this allows him to mine academic material in India without the permission of publishers.
Whether text-mining is legal varies between countries, and even where it is theoretically legal faces practical obstacles. As Pulla reports:
Some countries have changed their laws to affirm that researchers on non-commercial projects don’t need a copyright-holder’s permission to mine whatever they can legally access. The United Kingdom passed such a law in 2014, and the European Union voted through a similar provision this year. That doesn’t help academics in poor nations who don’t have legal access to papers. And even in the United Kingdom, publishers can legally place ‘reasonable’ restrictions on the process, such as channelling scientists through publisher-specific interfaces and limiting the speed of electronic searching or bulk downloading to protect servers from overload. Such limits are a big problem, says John McNaught, deputy director of the National Centre for Text Mining at the University of Manchester, UK. “A limit of, say, one article every five seconds, which sounds fast for a human, is painfully slow for a machine. It would take a year to download around six million articles, and five years to download all published articles concerning just biomedicine,” he says.
Note in particular that the 73M articles in Malamud's corpus greatly exceeds the number of articles to which researchers, even at major research libraries, have "legal access".

Apparently, Malamud's corpus originates somehow with Sci-Hub. Malamud's eventual provision of text-mining access would be even more of a threat to the oligopoly publishers than Sci-Hub's provision of read access. Like Sci-Hub, it provides what users want, a single unified portal to the entire academic literature. But in addition:
  • It is arguably legal in many countries, which Sci-Hub isn't.
  • It enables the future of access to the academic literature, rather than the past.
Lets hope that, as with Sci-Hub, the publishers' response triggers the Streisand Effect.

But, as always, it is important to ask "What Could Possibly Go Wrong?". Jonathan Zittrain's The Hidden Costs of Automated Thinking focuses on the inability of these machine learning models to explain their outputs, and their vulnerability to adversarial inputs:
It’s easy to imagine that the availability of machine-learning-based knowledge will shift funding away from researchers who insist on the longer route of trying to figure things out for themselves. This past December, Mohammed AlQuraishi, a researcher who studies protein folding, wrote an essay exploring a recent development in his field: the creation of a machine-learning model that can predict protein folds far more accurately than human researchers. AlQuiraishi found himself lamenting the loss of theory over data, even as he sought to reconcile himself to it. “There’s far less prestige associated with conceptual papers or papers that provide some new analytical insight,” he said, in an interview. As machines make discovery faster, people may come to see theoreticians as extraneous, superfluous, and hopelessly behind the times. Knowledge about a particular area will be less treasured than expertise in the creation of machine-learning models that produce answers on that subject.
Zittrain's whole argument is well worth a read.

No comments: