Thursday, May 21, 2015

Unrecoverable read errors

Trevor Pott has a post at The Register entitled Flash banishes the spectre of the unrecoverable data error in which he points out that while disk manufacturers quoted Bit Error Rates (BER) for hard disks are typically 10-14 or 10-15, SSD BERs range from 10-16 for consumer drives to 10-18 for hardened enterprise drives. Below the fold, a look at his analysis of the impact of this difference of up to 4 orders of magnitude.

Tuesday, May 19, 2015

How Google Crawls Javascript

I started blogging about the transition the Web is undergoing from a document to a programming model, from static to dynamic content, some time ago. This transition has very fundamental implications for Web archiving; what exactly does it mean to preserve something that is different every time you look at it? Not to mention the vastly increased cost of ingest, because executing a program takes a lot more, a potentially unlimited amount of, computation than simply parsing a document.

The transition has big implications for search engines too; they also have to execute rather than parse. Web developers have a strong incentive to make their pages search engine friendly, so although they have enthusiastically embraced Javascript they have often retained a parse-able path for search engine crawlers to follow. We have watched academic journals adopt Javascript, but so far very few have forced us to execute it to find their content.

Adam Audette and his collaborators at Merkle | RKG have an interesting post entitled We Tested How Googlebot Crawls Javascript And Here’s What We Learned. It is aimed at the SEO (Search Engine Optimzation) world but it contains a lot of useful information for Web archiving. The TL;DR is that Google (but not yet other search engines) is now executing the Javascript in ways that make providing an alternate, parse-able path largely irrelevant to a site's ranking. Over time, this will mean that the alternate paths will disappear, and force Web archives to execute the content.

Friday, May 15, 2015

A good op-ed on digital preservation

Bina Venkataraman was White House adviser on climate change innovation and is now at the Broad Foundation Institute working on long-term vs. short-term issues. She has a good op-ed piece in Sunday's Boston Globe entitled The race to preserve disappearing data. She and I e-mailed to and fro as she worked on the op-ed, and I'm quoted in it.

Update: Bina's affiliation corrected - my bad.

Tuesday, May 12, 2015

Potemkin Open Access Policies

Last September Cameron Neylon had an important post entitled Policy Design and Implementation Monitoring for Open Access that started:
We know that those Open Access policies that work are the ones that have teeth. Both institutional and funder policies work better when tied to reporting requirements. The success of the University of Liege in filling its repository is in large part due to the fact that works not in the repository do not count for annual reviews. Both the NIH and Wellcome policies have seen substantial jumps in the proportion of articles reaching the repository when grantees final payments or ability to apply for new grants was withheld until issues were corrected.
He points out that:
Monitoring Open Access policy implementation requires three main steps. The steps are:
  1. Identify the set of outputs are to be audited for compliance
  2. Identify accessible copies of the outputs at publisher and/or repository sites
  3. Check whether the accessible copies are compliant with the policy
Each of these steps are difficult or impossible in our current data environment. Each of them could be radically improved with some small steps in policy design and metadata provision, alongside the wider release of data on funded outputs.
He makes three important recommendations:
  • Identification of Relevant Outputs: Policy design should include mechanisms for identifying and publicly listing outputs that are subject to the policy. The use of community standard persistable and unique identifiers should be strongly recommended. Further work is needed on creating community mechanisms that identify author affiliations and funding sources across the scholarly literature.
  • Discovery of Accessible Versions: Policy design should express compliance requirements for repositories and journals in terms of metadata standards that enable aggregation and consistent harvesting. The infrastructure to enable this harvesting should be seen as a core part of the public investment in scholarly communications.
  • Auditing Policy Implementation: Policy requirements should be expressed in terms of metadata requirements that allow for automated implementation monitoring. RIOXX and ALI proposals represent a step towards enabling automated auditing but further work, testing and refinement will be required to make this work at scale.
What he is saying is that defining policies that mandate certain aspects of Web-published materials without mandating that they conform to standards that make them enforceable over the Web is futile. This should be a no-brainer. The idea that, at scale, without funding, conformance will be enforced manually is laughable. The idea that researchers will voluntarily comply when they know that there is no effective enforcement is equally laughable.

Thursday, May 7, 2015

Amazon's Margins

Barry Ritholtz points me to Ben Thompson's post The AWS IPO, in which he examines Amazon's most recent financials. They're the first in which Amazon has broken out AWS as a separate line of business, so they are the first to reveal the margins Amazon is achieving on their cloud business. The answer is:
AWS is very profitable: $265 million in profit on $1.57 billion in sales last quarter alone, for an impressive (for Amazon!) 17% net margin.
The post starts by supposing that Amazon spun out AWS via an IPO:
One of the technology industry’s biggest and most important IPOs occurred late last month, with a valuation of $25.6 billion dollars. That’s more than Google, which IPO’d at a valuation of $24.6 billion, and certainly a lot more than Amazon, which finished its first day on the public markets with a valuation of $438 million.
It concludes:
The profitability of AWS is a big deal in-and-of itself, particularly given the sentiment that cloud computing will ultimately be a commodity won by the companies with the deepest pockets. It turns out that all the reasons to believe in AWS were spot on: Amazon is clearly reaping the benefits of scale from being the largest player, and their determination to have both the most complete and cheapest offering echoes their prior strategies in e-commerce.
Thompson's post is a must-read; I've only given a small taste of it. But it clearly demonstrates that even AWS overall is very profitable, let alone the profitability of S3, its storage service, which I've been blogging about for more than three years.

Tuesday, May 5, 2015

Max Planck Digital Library on Open Access

Ralf Schimmer of the Max Planck Society's Digital Library  gave a fascinating presentation (PPT) as part of a panel entitled What Price Open Access at the recent CNI meeting. He, and co-authors Kai Karin Geschuhn and Andreas Vogler have now posted the paper on which it was based, Disrupting the subscription journals' business model for the necessary large-scale transformation to open access. Their argument is:
All the indications are that the money already invested in the research publishing system is sufficient to enable a transformation that will be sustainable for the future. There needs to be a shared understanding that the money currently locked in the journal subscription system must be withdrawn and re-purposed for open access publishing services. The current library acquisition budgets are the ultimate reservoir for enabling the transformation without financial or other risks.
They present:
generic calculations we have made on the basis of available publication data and revenue values at global, national and institutional levels.
These include detailed data as to their own spending on open access article processing charges (APCs), which they have made available on-line, and from many other sources including the Wellcome Trust and the Austrian Science Fund. They show that APCs are less than €2.0K/article while subscription costs are €3.8-5.0K/article, so the claim that sufficient funds are available is credible. It is important to note that they exclude hybrid APCs such as those resulting from the stupid double-dipping deals the UK made; these are "widely considered not to reflect a true market value". As an Englishman, I appreciate under-statement. Thus they support my and Andrew Odlyzko's contention that margins in the academic publishing business are extortionate.

Below the fold, I look at some of the details in the paper.

Friday, May 1, 2015

Talk at IIPC General Assembly

The International Internet Preservation Consortium's General Assembly brings together those involved in Web archiving from around the world. This year's was held at Stanford and the Internet Archive. I was asked to give a short talk outlining the LOCKSS Program, explaining how and why it differs from most Web archiving efforts, and how we plan to evolve it in the near future to align it more closely with the mainstream of Web archiving. Below the fold, an edited text with links to the sources.