I've just finished attending the 2016 Usenix FAST conference. Eric Brewer of Google gave a fascinating keynote, which deserves a complete post to itself. Below the fold are notes on the papers that caught my eye, arranged by topic.
I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.
Friday, February 26, 2016
Monday, February 22, 2016
1000 long-tail publishers!
The e-journal content that is at risk of loss or cancellation comes from the "long tail" of small publishers. Somehow, the definition of "small publisher" has come to be one that publishes ten or fewer journals. This seems pretty big to me, but if we adopt this definition the LOCKSS Program just passed an important milestone. We just sent out a press release announcing that the various networks using LOCKSS technology now preserve content from over 1000 long-tail publishers. There is still a long way to go, but as the press release says:
there are tens of thousands of long tail publishers worldwide, which makes preserving the first 1,000 publishers an important first step to a larger endeavor to protect vulnerable digital content.
Saturday, February 20, 2016
Andrew Orlowski speaks!
At the Battle of Ideas Festival at the Barbican last year, Claire Fox chaired a panel titled "Is Technology Limiting Our Humanity?", and invited my friend Andrew Orlowski of The Register to speak. Two short but thought-provoking extracts are now up, which The Register's editors have entitled:
- Terrified robots will take middle class jobs? Look in a mirror, in which Andrew argues that jobs are automated only when they have been drained of all human functions such as judgement.
- Meet the original Big Data, TED Talk, Thought Shower Futurist, in which Andrew discusses the analogies between the work of William Playfair (1759-1823) and today's Big Data enthusiasts.
an embezzler and a blackmailer, with some unscrupulous data-gathering methods. He would kidnap farmers until they told him how many sheep they had. Today he’s remembered as the father of data visualisation. He was the first to use the pie chart, the line chart, the bar chart.Both extracts are worth your time.
...
Playfair stressed the confusion of the moment, its historical discontinuity, and advanced himself as a guru with new methods who was able to make sense of it.
Thursday, February 18, 2016
Gadarene swine
I've been ranting about the way we,
possessed by the demons of the Internet of Things,
are rushing like the Gadarene Swine to our doom.
Below the fold,
the latest rant in the series, which wanders off into related, but equally doom-laden areas.
Thursday, February 11, 2016
James Jacobs on Looking Forward
Government documents have long been a field that the LOCKSS Program has been involved in. Recent history, such as that of the Harper administration in Canada, is full of examples of Winston Smith style history editing by governments. This makes it essential that copies of government documents are maintained outside direct government custody, and several private LOCKSS networks are doing this for various kinds of government documents. Below the fold, a look at the US Federal Depository Library Program, which has been doing this in the paper world for a long time, and the state of its gradual transition to the digital world.
Tuesday, February 9, 2016
The Malware Museum
Mikko Hypponen and Jason Scott at the Internet Archive have put up the Malware Museum:
I discussed the issues around malware in my report on emulation. The malware in the Malware Museum is too old to be networked, and thus avoids the really difficult issues that running software with access to the network that is old, and thus highly vulnerable, causes.
Even if emulation can ensure that only the virtual machine and not its host is infected, and users can be warned not to input any personal information to it, this may not be enough. The goal of the infection is likely to be to co-opt the virtual machine into a botnet, or to act as a Trojan on your network. If you run this vulnerable software you are doing something that a reasonable person would understand puts other people's real machines at risk. The liability issues of doing so bear thinking about.
a collection of malware programs, usually viruses, that were distributed in the 1980s and 1990s on home computers. Once they infected a system, they would sometimes show animation or messages that you had been infected. Through the use of emulations, and additionally removing any destructive routines within the viruses, this collection allows you to experience virus infection of decades ago with safety.The museum is an excellent use of emulation and well worth a visit.
I discussed the issues around malware in my report on emulation. The malware in the Malware Museum is too old to be networked, and thus avoids the really difficult issues that running software with access to the network that is old, and thus highly vulnerable, causes.
Even if emulation can ensure that only the virtual machine and not its host is infected, and users can be warned not to input any personal information to it, this may not be enough. The goal of the infection is likely to be to co-opt the virtual machine into a botnet, or to act as a Trojan on your network. If you run this vulnerable software you are doing something that a reasonable person would understand puts other people's real machines at risk. The liability issues of doing so bear thinking about.
Tuesday, February 2, 2016
Always read the fine print
When Amazon announced Glacier I took the trouble to read their pricing information carefully and wrote:
Because the cost penalties for peak access to storage and for small requests are so large ..., if Glacier is not to be significantly more expensive than local storage in the long term preservation systems that use it will need to be carefully designed to rate-limit accesses and to request data in large chunks.Now, 40 months later, Simon Sharwood at The Register reports that people who didn't pay attention are shocked that using Glacier can cost more in a month than enough disk to store the data 60 times over:
Last week, a chap named Mario Karpinnen took to Medium with a tale of how downloading 60GB of data from Amazon Web Services' archive-grade Glacier service cost him a whopping US$158.Karpinnen's post is a cautionary tale for Glacier believers, but the real problem is he didn't look the gift horse in the mouth:
Karpinnen went into the fine print of Glacier pricing and found that the service takes your peak download rate, multiplies the number of gigabytes downloaded in your busiest hour for the month and applies it to every hour of the whole month. His peak data retrieval rate of 15.2GB an hour was therefore multiplied by the $0.011 per gigabyte charged for downloads from Glacier. And then multiplied by the 744 hours in January. Once tax and bandwidth charges were added, in came the bill for $158.
But doing the math (and factoring in VAT and the higher prices at AWS’s Irish region), I had the choice of either paying almost $10 a month for the simplicity of S3 or just 87¢/mo for what was essentially the same thing,He should have asked himself how Amazon could afford to sell "essentially the same thing" for one-tenth the price. Why wouldn't all their customers switch? I asked myself this in my post on the Glacier announcement:
In order to have a competitive product in the the long-term storage market Amazon had to develop a new one, with a different pricing model. S3 wasn't competitive.As Sharwood says:
Karpinnen's post and Oracle's carping about what it says about AWS both suggest a simple moral to this story: cloud looks simple, but isn't, and buyer beware applies every bit as much as it does for any other product or service.The fine print was written by the vendor's lawyers. They are not your friends.