Thursday, July 20, 2017

Patting Myself On The Back

Cost vs. Kryder rate
I started working on economic models of long-term storage six years ago, and quickly discovered the effect shown in this graph. It plots the endowment, the money which, deposited with the data and invested at interest, pays for the data to be stored "forever", as a function of the Kryder rate, the rate at which $/GB drops with time. As the rate slows below about 20%,  the endowment needed rises rapidly. Back in early 2011 it was widely believed that 30-40% Kryder rates were a law of nature, they had been that way for 30 years. Thus, if you could afford to store data for the next few years you could afford to store it forever

2014 cost/byte projection
As it turned out, 2011 was a good time to work on this issue. That October floods in Thailand destroyed 40% of the world's disk manufacturing capacity, and disk prices spiked. Preeti Gupta at UC Santa Cruz reviewed disk pricing in 2014 and we produced this graph. I wrote at the time:
The red lines are projections at the industry roadmap's 20% and a less optimistic 10%. [The graph] shows three things:
  • The slowing started in 2010, before the floods hit Thailand.
  • Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.
  • If the industry projections pan out, as shown by the red lines, by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.
Backblaze average $/GB
Thanks to Backblaze's admirable transparency, we have 3 years more data. Their blog reports on their view of disk pricing as a bulk purchaser over many years. It is far more detailed than the data Preeti was able to work with. Eyeballing the graph, we see a 2013 price around 5c/GB and a 2017 price around half that. A 10% Kryder rate would have meant a 2017 price of 3.2c/GB, and a 20% rate would have meant 2c/GB, so the out-turn lies between the two red lines on our graph. It is difficult to make predictions, especially about the future. But Preeti and I nailed this one.

This is a big deal. As I've said many times:
Storage will be
Much less free
Than it used to be
The real cost of a commitment to store data for the long term is much greater than most people believe, and there is no realistic prospect of a technological discontinuity that would change this.

Tuesday, July 11, 2017

Is Decentralized Storage Sustainable?

There are many reasons to dislike centralized storage services. They include business risk, as we see in le petit musée des projets Google abandonnés, monoculture vulnerability and rent extraction. There is thus naturally a lot of enthusiasm for decentralized storage systems, such as MaidSafe, DAT and IPFS. In 2013 I wrote about one of their advantages in Moving vs. Copying. Among the enthusiasts is Lambert Heller. Since I posted Blockchain as the Infrastructure for Science, Heller and I have been talking past each other. Heller is talking technology; I have some problems with the technology but they aren't that important. My main problem is an economic one that applies to decentralized storage irrespective of the details of the technology.

Below the fold is an attempt to clarify my argument. It is a re-statement of part of the argument in my 2014 post Economies of Scale in Peer-to-Peer Networks, specifically in the context of decentralized storage networks.

Thursday, July 6, 2017

Archive vs. Ransomware

Archives perennially ask the question "how few copies can we get away with?"
This is a question I've blogged about in 2016 and 2011 and 2010, when I concluded:
  • The number of copies needed cannot be discussed except in the context of a specific threat model.
  • The important threats are not amenable to quantitative modeling.
  • Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".
I've also written before about the immensely profitable business of ransomware. Recent events, such as WannaCrypt, NotPetya and the details of NSA's ability to infect air-gapped computers should convince anyone that ransomware is a threat to which archives are exposed. Below the fold I look into how archives should be designed to resist this credible threat.

Thursday, June 29, 2017

"to promote the progress of useful Arts"

This is just a quick note to say that anyone who believes the current patent and copyright systems are working "to promote the progress of useful Arts" needs to watch Bunnie Huang's talk to the Stanford EE380 course, and read Bunnie's book The Hardware Hacker. Below the fold, a brief explanation.

Tuesday, June 27, 2017

Wall Street Journal vs. Google

After we worked together at Sun Microsystems, Chuck McManis worked at Google then built another search engine (Blekko). His contribution to the discussion on Dave Farber's IP list about the argument between the Wall Street Journal and Google is very informative. Chuck gave me permission to quote liberally from it in the discussion below the fold.

Thursday, June 22, 2017

WAC2017: Security Issues for Web Archives

Jack Cushman and Ilya Kreymer's Web Archiving Conference talk Thinking like a hacker: Security Considerations for High-Fidelity Web Archives is very important. They discuss 7 different security threats specific to Web archives:
  1. Archiving local server files
  2. Hacking the headless browser
  3. Stealing user secrets during capture
  4. Cross site scripting to steal archive logins
  5. Live web leakage on playback
  6. Show different page contents when archived
  7. Banner spoofing
Below the fold, a brief summary of each to encourage you to do two things:
  1. First, view the slides.
  2. Second, visit http://warc.games., which is a sandbox with
    a local version of Webrecorder that has not been patched to fix known exploits, and a number of challenges for you learn how they might apply to web archives in general.

Tuesday, June 20, 2017

Analysis of Sci-Hub Downloads

Bastian Greshake has a post at the LSE's Impact of Social Sciences blog based on his F1000Research paper Looking into Pandora's Box. In them he reports on an analysis combining two datasets released by Alexandra Elbakyan:
  • A 2016 dataset of 28M downloads from Sci-Hub between September 2015 and February 2016.
  • A 2017 dataset of 62M DOIs to whose content Sci-Hub claims to be able to provide access.
Below the fold, some extracts and commentary.