Tuesday, April 14, 2015

The Maginot Paywall

Two recent papers examine the growth of peer-to-peer sharing of journal articles. Guilliame Cabanac's Bibliogifts in LibGen? A study of a text-sharing platform driven by biblioleaks and crowdsourcing (LG) is a statistical study of the Library Genesis service, and Carolyn Caffrey Gardner and Gabriel J. Gardner's Bypassing Interlibrary Loan via Twitter: An Exploration of #icanhazpdf Requests (TW) is a similar study of one of the sources for Library Genesis. Both implement forms of Aaron Swartz's Guerilla Open Access Manifesto, a civil disobedience movement opposed to the malign effects of current copyright law on academic research. Below the fold, some thoughts on the state of this movement.

Friday, April 10, 2015

3D Flash - not as cheap as chips

Chris Mellor has an interesting piece at The Register pointing out that while 3D NAND flash may be dense, its going to be expensive.

The reason is the enormous number of processing steps per wafer - between 96 and 144 deposition layers for the three leading 3D NAND flash technologies. Getting non-zero yields from that many steps involves huge investments in the fab:
Samsung, SanDisk/Toshiba, and Micron/Intel have already announced +$18bn investment for 3D NAND.
  • Samsung’s new Xi’an, China, 3D NAND fab involves a +$7bn total capex outlay
  • Micron has outlined a $4bn spend to expand its Singapore Fab 10
This compares with Seagate and Western Digital’s capex totalling ~$4.3 bn over the past three years.
Chris has this chart, from Gartner and Stifel, comparing the annual capital expenditure per TB of storage of NAND flash and hard disk. Each TB of flash contains at least 50 times as much capital as a TB of hard disk, which means it will be a lot more expensive to buy.

PS - "as cheap as chips" is a British usage.

Wednesday, April 8, 2015

Trying to fix the symptoms

In response to many complaints that the peer review process was so slow that it was impeding scientific progress, Nature announced that they would allow authors to pay to jump the queue:
As of 24th March 2015, a selection of authors submitting a biology manuscript to Scientific Reports will be able to opt-in to a fast-track peer-review service at an additional cost. Authors who opt-in to fast-track will receive an editorial decision (accept, reject or revise) with peer-review comments within three weeks of their manuscript passing initial quality checks. 
It is true that the review process is irritatingly slow, but this is a bad idea on many levels. Such a bad idea that an editorial board   member resigned in protest. Below the fold I discuss some of the levels.

Sunday, April 5, 2015

The Mystery of the Missing Dataset

I was interviewed for an upcoming news article in Nature about the problem of link rot in scientific publications, based on the recent Klein et al paper in PLoS One. The paper is full of great statistical data but, as would be expected in a scientific paper, lacks the personal stories that would improve a news article.

I mentioned the interview over dinner with my step-daughter, who was featured in the very first post to this blog when she was a grad student. She immediately said that her current work is hamstrung by precisely the kind of link rot Klein et al investigated. She is frustrated because the dataset from a widely cited paper has vanished from the Web. Below the fold, a working post that I will update as the search for this dataset continues.

Wednesday, April 1, 2015

Preserving Long-Form Digital Humanities

Carl Straumsheim at Inside Higher Ed reports on a sorely-needed new Mellon Foundation initiative supporting digital publishing in the humanities:
The Andrew W. Mellon Foundation is aggressively funding efforts to support new forms of academic publishing, which researchers say could further legitimize digital scholarship.

The foundation in May sent university press directors a request for proposals to a new grant-making initiative for long-form digital publishing for the humanities. In the e-mail, the foundation noted the growing popularity of digital scholarship, which presented an “urgent and compelling” need for university presses to publish and make digital work available to readers.
Note in particular:
The foundation’s proposed solution is for groups of university presses to ... tackle any of the moving parts that task is comprised of, including “...(g) distribution; and (h) maintenance and preservation of digital content.”
Below the fold, some thoughts on this based on experience from the LOCKSS Program.

Tuesday, March 24, 2015

The Opposite Of LOCKSS

Jill Lepore's New Yorker "Cobweb" article has focused attention on the importance of the Internet Archive, and the analogy with the Library of Alexandria. In particular on the risks implicit in the fact that both represent single points of failure because they are so much larger than any other collection.

Typically, Jason Scott was first to respond with a outline proposal to back up the Internet Archive, by greatly expanding the collaborative efforts of ArchiveTeam. I think Jason is trying to do something really important, and extremely difficult.

The Internet Archive's collection is currently around 15PB. It has doubled in size in about 30 months. Suppose it takes another 30 months to develop and deploy a solution at scale. We're talking crowd-sourcing a distributed backup of at least 30PB growing at least 3PB/year.

To get some idea of what this means, suppose we wanted to use Amazon's Glacier. This is, after all, exactly the kind of application Glacier is targeted at. As I predicted shortly after Glacier launched, Amazon has stuck with the 1c/GB/mo price. So in 2017 we'd be paying Amazon $3.6M a year just for the storage costs. Alternately, suppose we used Backblaze's Storage Pod 4.5 at their current price of about 5c/GB, for each copy we'd have paid $1.5M in hardware cost and be adding $150K worth per year. This ignores running costs and RAID overhead.

It will be very hard to crowd-source resources on this scale, which is why I say this is the opposite of Lots Of Copies Keep Stuff Safe. The system is going to be short of storage; the goal of a backup for the Internet Archive must be the maximum of reliability for the minimum of storage.

Nevertheless, I believe it would be well worth trying some version of his proposal and I'm happy to help any way I can. Below the fold, my comments on the design of such a system.

Tuesday, March 17, 2015

More Is Not Better

Hugh Pickens at /. points me to Attention decay in science, providing yet more evidence that the way the journal publishers have abdicated their role as gatekeepers is causing problems for science. The abstract claims:
The exponential growth in the number of scientific papers makes it increasingly difficult for researchers to keep track of all the publications relevant to their work. Consequently, the attention that can be devoted to individual papers, measured by their citation counts, is bound to decay rapidly. ... The decay is ... becoming faster over the years, signaling that nowadays papers are forgotten more quickly. However, when time is counted in terms of the number of published papers, the rate of decay of citations is fairly independent of the period considered. This indicates that the attention of scholars depends on the number of published items, and not on real time.
Below the fold, some thoughts.