Tuesday, May 2, 2017

Distill: Is This What Journals Should Look Like?

A month ago a post on the Y Combinator blog announced that they and Google have launched a new academic journal called Distill. Except this is no ordinary journal consisting of slightly enhanced PDFs, it is a big step towards the way academic communication should work in the Web era:
The web has been around for almost 30 years. But you wouldn’t know it if you looked at most academic journals. They’re stuck in the early 1900s. PDFs are not an exciting form.

Distill is taking the web seriously. A Distill article (at least in its ideal, aspirational form) isn’t just a paper. It’s an interactive medium that lets users – “readers” is no longer sufficient – work directly with machine learning models.
Below the fold, I take a close look at one of the early articles to assess how big a step this is.

How to Use t-SNE Effectively is one of Distill's launch articles. It has a DOI - doi:10.23915/distill.00002. It can be cited like any other paper:
For attribution in academic contexts, please cite this work as

Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016. http://doi.org/10.23915/distill.00002

BibTeX citation

  author = {Wattenberg, Martin and ViƩgas, Fernanda and Johnson, Ian},
  title = {How to Use t-SNE Effectively},
  journal = {Distill},
  year = {2016},
  url = {http://distill.pub/2016/misread-tsne},
  doi = {10.23915/distill.00002}
But this really isn't a conventional article:

Updates and Corrections

View all changes to this article since it was first published. If you see a mistake or want to suggest a change, please create an issue on GitHub. ... with the source available on GitHub.
The sub-head explains the article's goal:
Although extremely useful for visualizing high-dimensional data, t-SNE plots can sometimes be mysterious or misleading. By exploring how it behaves in simple cases, we can learn to use it more effectively.
Which is where it starts to look very different. It matches the goal set out in the blog post:
Ideally, such articles will integrate explanation, code, data, and interactive visualizations into a single environment. In such an environment, users can explore in ways impossible with traditional static media. They can change models, try out different hypotheses, and immediately see what happens. That will let them rapidly build their understanding in ways impossible in traditional static media.
And the article itself isn't static, its more like a piece of open-source software:

Citations and Reuse

Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0, unless noted otherwise, with the source available on GitHub. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: “Figure from …”.
So far, so much better than a PDF, as you can see by visiting the article and playing with the examples, adjusting the sliders to see how the parameters affect the results.

But is this article as preservable as a PDF? You can try an interesting experiment. On your laptop, point your browser at the article, wait for it to load and show that the examples work. Now turn off WiFi, and the examples continue to work!

Using "View Source" you can see that the functionality of the article is implemented by a set of JavaScript files:
<script src="assets/d3.min.js"></script>
<script src="assets/tsne.js"></script>
<script src="assets/demo-configs.js"></script>
<script src="assets/figure-configs.js"></script>
<script src="assets/visualize.js"></script>
<script src="assets/figures.js"></script>
which are installed in your browser during page load and can be captured by a suitably configured crawler. So it is in principle as preservable as a PDF.

The Wayback Machine's first capture of the article is only partly functional, probably because the Internet Archive's crawler wasn't configured to capture everything. Webrecorder.io collects a fully functional version.

Here is a brief look at some of the other articles now up at Distill:
  • Attention and Augmented Recurrent Neural Networks, Olah & Carter and Deconvolution and Checkerboard Artifacts, Odena et al both contain interactive diagrams illustrating the details of the algorithms they discuss. These again work when networking is disabled. Thus they both seem to be preservable.
  • In Four Experiments in Handwriting with a Neural Network, Carter et al write:
    Neural networks are an extremely successful approach to machine learning, but it’s tricky to understand why they behave the way they do. This has sparked a lot of interest and effort around trying to understand and visualize them, which we think is so far just scratching the surface of what is possible.

    In this article we will try to push forward in this direction by taking a generative model of handwriting and visualizing it in a number of ways. The model is quite simple (so as to run well in the browser) so the generated output mostly produces gibberish letters and words (albeit, gibberish that look like real handwriting), but it is still useful for our purposes of exploring visualization techniques.
    Thus, like the Wattenberg et al article, this paper actually contains an implementation of the algorithm it discusses. In this case it is a model derived by machine learning, albeit one simple enough to run in the browser. Again, you can disable networking and show that the article's model and the animations remain fully functional.
  • Why Momentum Really Works, Gabriel Goh is similar, in that its interactive diagrams are powered by an implementation of the optimzation technique it describes, which is again functional in the absence of network connectivity.
Clearly, Distill articles are a powerful way to communicate and explain the behavior of algorithms for machine learning. But there are still issues. Among the non-technical issues are:
  • Since Distill articles are selected via a pre-publication peer review (they are also subject to post-publication review via GitHub), each needs a private GitHub repository during review, which is a cost presumably borne by the authors. But there don't appear to be any author processing charges (APCs).
  • There is a separate Distill Prize for "outstanding work communicating and refining ideas" with an endowment $125K:
    The Distill Prize has a $125,000 USD initial endowment, funded by Chris Olah, Greg Brockman, Jeff Dean, DeepMind, and the Open Philanthropy Project. Logistics for the prize are handled by the Open Philanthropy Project.
  • The prize endowment does not explain how the journal itself is funded. It isn't clear how the costs will be covered. Distill is open access, and does not appear to levy APCs. I can't find any information about how the journal is funded on the site. The journal is presumably sponsored to some extent by the deep pockets of Google and Y Combinator, which could raise issues of editorial independence.
  • The costs of running the journal will be significant. There are the normal costs of the editorial and review processes, and running costs of the Web site. But in addition, the interactive graphics are of extremely high quality,  due presumably not to graphic desing talent among the authors but to Distill's user interface design support:
    Distill provides expert editing to help authors improve their writing and diagrams.
    The editors' job is presumably made easier by the suite of tools provided to authors, but this expertise also costs money.
  • Distill does appear committed to open access to research. Attention and Augmented Recurrent Neural Networks has 21 references. An example is a paper published in the Journal of Machine Learning Research as Proceedings of the 33rd International Conference on Machine Learning. It appears as:
    Ask Me Anything: Dynamic Memory Networks for Natural Language Processing[PDF]
    Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Gulrajani, I. and Socher, R., 2015. CoRR, Vol abs/1506.07285.
    citing and linking to the Computing Research Repository at arxiv.org rather than the journal. JMLR doesn't appear to have DOIs via which to link. 14/21 of the article's references similarly obscure the actual publication (even if it is open access), and another two are to preprints. Instead they link to the open access version at arxiv.org. Presumably this is an editorial policy.
Among the technical issues are:
  • Distill is a journal about machine learning, which is notoriously expensive in computation. There are limits as to how much computation can feasibly be extracted from your browser, so there will clearly be topics for which an appropriate presentation requires significant support from data-center resources. These will not be preservable.
  • Machine learning also notoriously requires vast amounts of data. This is another reason why some articles will require data-center support, and will not be preservable.
  • GitHub is not a preservation repository, so the conversation around the articles will not automatically be preserved.
  • If articles need data-center support, there will be unpredictable on-going costs that somehow need to be covered, a similar problem to the costs involved in cloud-based emulation services.
  • The title of an article's GitHub repository looks like this:
    How to Use t-SNE Effectively http://distill.pub/2016/misread-tsne/
    but the link should not be direct to the distill.pub website but should go via the DOI https://doi.org/10.23915/distill.00002. This is yet another instance of the problem, discussed in Persistent URIs Must Be Used To Be Persistent, by Herbert van de Sompel et al, of publishers preferring to link directly not to DOIs, and thus partially defeating their purpose.
The interactive diagrams and examples that provide the pedagogic power of Distill are constrained by the limits of what can be implemented in a browser. Communicating about machine learning at even fairly small scale will be out of reach. For example, Distill obviously could not publish anything like End to End Learning for Self-Driving Cars by Bojarksi et al, which describes how NVIDIA used one of their boxes to learn, and another to execute, a model capable of autonomously steering a car in traffic on the New Jersey Turnpike. The compute power of these boxes is far greater than is available from a browser. Even if data center support were available. the browser still lacks a camera, steering, brakes, wheels, and an engine. Not to mention a professional self-driving car driver.

Note that Distill's use of GitHub is similar to the way the Journal of Open Source Software operates, but JOSS doesn't support execution of the software in the way Distill does, so does not require data-center support. Nor, since it covers only open source code, does it need private repositories.

In summary, given the limitations of the browser's execution environment, Distill does an excellent job of publishing articles that use interactivity to provide high-quality explanations and the ability to explore the parameter space. It does so without sacrificing preservability, which is important. It isn't clear how sustainable the journal will be, nor how much of the field of machine learning will be excluded by the medium's limitations.

1 comment:

David. said...

Another argument for programmability in presenting articles comes from Roli Roberts's An Unexpected Perk of our Data Policy in which he describes a colleague presenting multiple different visualizations of the data behind a graph in a PLOS Biology article:

"There are clearly many ways to skin a cat, as the unfortunate English adage has it, and perhaps one day all journals will present the data in an immediately implementable form; readers who like heatmaps could click “heatmap” and those who prefer 3D could click “3D,” etc., but at least the provision of the underlying data meant that this feline flaying was even a possibility. Long live open data!"