Tuesday, October 27, 2015

More interesting numbers from Backblaze

On the 17th of last month Amazon, in some regions, cut the Glacier price from 1c/GB/month to 0.7c/GB/month. It had been stable since it was announced in August 2012. As usual with Amazon, they launched at an aggressive and attractive price, and stuck there for a long time. Glacier wasn't under a lot of competitive pressure, so they didn't need to cut the price. Below the fold, I look at how Backblaze changed this.

Wednesday, October 21, 2015

ISO review of OAIS

ISO standards are regularly reviewed. In 2017, the OAIS standard ISO14721 will be reviewed. The DPC is spearheading a praiseworthy effort to involve the digital preservation community in the process of providing input to this review, via this Wiki.

I've been critical of OAIS over the years, not so much of the standard itself, but of the way it was frequently mis-used. Its title is Reference Model for an Open Archival Information System (OAIS), but it is often treated as if it were entitled The Definition of Digital Preservation, and used as a way to denigrate digital preservation systems that work in ways the speaker doesn't like by claiming that the offending system "doesn't conform to OAIS". OAIS is a reference model and, as such, defines concepts and terminology. It is the concepts and terminology used to describe a system that can be said to conform to OAIS.

Actual systems are audited for conformance to a set of OAIS-based criteria, defined currently by ISO16363. The CLOCKSS Archive passed such an audit last year with flying colors. Based on this experience, we identified a set of areas in which the concepts and terminology of OAIS were inadequate to describe current digital preservation systems such as the CLOCKSS Archive.

I was therefore asked to inaugurate the DPC's OAIS review Wiki with a post that I entitled The case for a revision of OAIS. My goal was to encourage others to post their thoughts. Please read my post and do so.

Tuesday, October 20, 2015

Storage Technology Roadmaps

At the recent Library of Congress Storage Architecture workshop, Robert Fontana of IBM gave an excellent overview of the roadmaps for tape, disk, optical and NAND flash (PDF) storage technologies in terms of bit density and thus media capacity. His slides are well worth studying, but here are his highlights for each technology:
  • Tape has a very credible roadmap out to LTO10 with 48TB/cartridge somewhere around 2022.
  • Optical's roadmap shows increases from the current 100GB/disk to 200, 300, 500 and 1000GB/disk, but there are no dates on them. At least two of those increases will encounter severe difficulties making the physics work.
  • The hard disk roadmap shows the slow increase in density that has prevailed for the last 4 years continuing until 2017, when it accelerates to 30%/yr. The idea is that in 2017 Heat Assisted Magnetic Recording (HAMR) will be combined with shingling, and then in 2021 Bit Patterned Media (BPM) will take over, and shortly after be combined with HAMR.
  • The roadmap for NAND flash is for density to increase in the near term by 2-3X and over the next 6-8 years by 6-8X. This will require significant improvements in processing technology but "processing is a core expertise of the semiconductor industry so success will follow".
Below the fold, my comments.

Friday, October 16, 2015

Securing WiFi routers

Via Dave Farber's IP list, I find that he, Dave Taht, Jim Gettys, the bufferbloat team, and other luminaries have submitted a response to the FCC's proposed rule-making (PDF) that would have outlawed software defined radios and open source WiFi router software such as OpenWrt. My blogging about the Internet of Things started a year ago from a conversation with Jim when he explained the Moon malware, which was scanning home routers. It subsequently turned out to be preparing to take out Sony and Microsoft's gaming networks at Christmas. Its hard to think of a better demonstration of the need for reform of the rules for home router software, but the FCC's proposal to make the only reasonably secure software for them illegal is beyond ridiculous.

The recommendations they submitted are radical but sensible and well-justified by events:
  1. Any vendor of software-defined radio (SDR), wireless, or Wi-Fi radio must make public the full and maintained source code for the device driver and radio firmware in order to maintain FCC compliance. The source code should be in a buildable, change-controlled source code repository on the Internet, available for review and improvement by all.
  2. The vendor must assure that secure update of firmware be working at time of shipment, and that update streams be under ultimate control of the owner of the equipment. Problems with compliance can then be fixed going forward by the person legally responsible for the router being in compliance.
  3. The vendor must supply a continuous stream of source and binary updates that must respond to regulatory transgressions and Common Vulnerability and Exposure reports (CVEs) within 45 days of disclosure, for the warranted lifetime of the product, or until five years after the last customer shipment, whichever is longer.
  4. Failure to comply with these regulations should result in FCC decertification of the existing product and, in severe cases, bar new products from that vendor from being considered for certification.
  5. Additionally, we ask the FCC to review and rescind any rules for anything that conflicts with open source best practices, produce unmaintainable hardware, or cause vendors to believe they must only ship undocumented “binary blobs” of compiled code or use lockdown mechanisms that forbid user patching. This is an ongoing problem for the Internet community committed to best practice change control and error correction on safety-critical systems.
As the submission points out, experience to date shows that vendors of home router equipment are not motivated to, do not have the skills to, and do not, maintain the security of their software. Locking down the vendor's insecure software so it can't be diagnosed or updated is a recipe for even more such disasters. The vendors don't care if their products are used in botnets or steal their customer's credentials. Forcing the vendors to use open source software and to respond in a timely fashion to vulnerability discoveries on pain of decertification is the only way to fix the problems.

Thursday, October 15, 2015

A Pulitzer is no guarantee

Bina Venkataraman points me to Adrienne LaFrance's piece Raiders of the Lost Web at The Atlantic. It is based on an account of last month's resurrection of a 34-part, Pulitzer-winning newspaper investigation from 2007 of the aftermath of a 1961 railroad crossing accident in Colorado. It vanished from the Web when The Rocky Mountain News folded and survived only because Kevin Vaughan, the reporter, kept a copy on DVD-ROM.

Doing so likely violated copyright. Even though The Crossing was not an "orphan work":
in 2009, the year the paper went under, Vaughan began asking for permission—from the [Denver Public] library and from E.W. Scripps, the company that owned the Rocky—to resurrect the series. After four years of back and forth, in 2013, the institutions agreed to let Vaughan bring it back to the web.
Four years, plus another two to do the work. Imagine how long it would have taken had the story actually been orphaned. Vaughan also just missed another copyright problem:
With [ex-publisher John] Temple’s help, Vaughan got permission from the designer Roger Black to use Rocky, the defunct newspaper’s proprietary typeface.
This is the orphan font problem that I've been warning about for the last 6 years. There is a problem with the resurrected site:
It also relied heavily on Flash, once-ubiquitous software that is now all but dead. “My role was fixing all of the parts of the website that had broken due to changes in web standards and a change of host,” said [Kevin's son] Sawyer, now a junior studying electrical engineering and computer science. “The coolest part of the website was the extra content associated with the stories... The problem with the website is that all of this content was accessible to the user via Flash.”
It still is. Soon, accessing the "coolest part" of the resurrected site will require a virtual machine with a legacy browser.

There is a problem with the article. It correctly credits the Internet Archive with its major contribution to Web archiving, and analogizes it to the Library of Alexandria. But it fails to mention any of the other Web archives and, unlike Jill Lepore's New Yorker "Cobweb" article, doesn't draw the lesson from the analogy. Because the Library of Alexandria was by far the largest repository of knowledge in its time, its destruction was a catastrophe. The Internet Archive is by far the largest Web archive, but it is uncomfortably close to several major faults. And backing it up seems to be infeasible.

Wednesday, October 14, 2015

Orphan Works "Reform"

Lila Bailey at Techdirt has a post entitled Digital Orphans: The Massive Cultural Black Hole On Our Horizon about the Copyright Office's proposal for fixing the "orphan works" problem. As she points out, the proposal doesn't fix it, it just makes it different:
it doesn't once mention or consider the question of what we are going to do about the billions of orphan works that are being "born digital" every day.

Instead, the Copyright Office proposes to "solve" the orphan works problem with legislation that would impose substantial burdens on users that would only work for one or two works at any given time. And because that system is so onerous, the Report also proposes a separate licensing regime to support so-called "mass digitization," while simultaneously admitting that this regime would not really be appropriate for orphans (because there's no one left to claim the licensing fees). These proposals have been resoundingly criticized for many valid reasons.
She represented the Internet Archive in responding to the report, so she knows whereof she writes about the born-digital user-generated content that documents today's culture:
We are looking down the barrel of a serious crisis in terms of society's ability to access much of the culture that is being produced and shared online today. As many of these born-digital works become separated from their owners, perhaps because users move on to newer and cooler platforms, or because the users never wanted their real identity associated with this stuff in the first place, we will soon have billions upon billions of digital orphans on our hands. If those orphans survive the various indignities that await them ... we are going to need a way to think about digital orphans. They clearly will not need to be digitized so the Copyright Office's mass digitization proposal would not apply.
The born-digital "orphan works" problem is intertwined with the problems posed by the fact that much of this content is dynamic, and its execution depends on other software not generated by the user, which is both copyright and covered by an end-user license agreement, and is not being collected by the national libraries under copyright deposit legislation.

Tuesday, October 13, 2015

Access vs. Control

Hugh Rundle, in an interesting piece entitled When access isn’t even access reporting from the Australian Internet Governance Forum writes about librarians:
Worst of all, our obsession with providing access ultimately results in the loss of access. Librarians created the serials crisis because they focussed on access instead of control. The Open Access movement has had limited success because it focusses on access to articles instead of remaking the economics of academic careers. Last week Proquest announced it had gobbled up Ex-Libris, further centralising corporate control over the world’s knowledge. Proquest will undoubtedly now charge even more for their infinitely-replicable-at-negligible-cost digital files. Libraries will pay, because ‘access’. At least until they can’t afford it. The result of ceding control over journal archives has not been more access, but less.
As Benjamin Franklin might have said if he was a librarian: those who would give up essential liberty, to purchase a little access, deserve neither and will lose both.
From the very beginning of the LOCKSS Program 17 years ago, the goal has been to provide librarians with the tools they need to take control and ownership of the content they pay for. As I write, JSTOR has been unable to deliver articles for three days and libraries all over the world have been deprived of access to all the content they have paid JSTOR for through the years. Had they owned copies of the content, as they did on paper, no such system-wide failure would have been been possible.

Friday, October 9, 2015

The Cavalry Shows Up in the IoT War Zone

Back in May I posted Time For Another IoT Rant. Since then I've added 28 comments about the developments over the last 132 days, or more than one new disaster every 5 days. Those are just the ones I noticed. So its time for another dispatch from the front lines of the IoT war zone on which I can hang reports of the disasters to come.  Below the fold, I cover yesterday's happenings on two sectors of the front line.

Thursday, October 8, 2015

Two In Two Days

Tuesday, Cory Doctorow pointed to "another of [Maciej Cegłowski's] barn-burning speeches". It is entitled What Happens Next Will Amaze You and it is a must-read exploration of the ecosystem of the Web and its business model of pervasive surveillance. I commented on my post from last May Preserving the Ads? pointing to it, because Cegłowski goes into much more of the awfulness of the Web ecosystem than I did.

Yesterday Doctorow pointed to another of Maciej Cegłowski's barn-burning speeches. This one is entitled Haunted by Data, and it is just as much of a must-read. Doctorow is obviously a fan of Cegłowski's and now so am I. It is hard to write talks this good, and even harder to ensure that they are relevant to stuff I was posting in May. This one takes the argument of The Panopticon Is Good For You, also from May, and makes it more general and much clearer. Below the fold, details.

Tuesday, October 6, 2015

Another good prediction

After patting myself on the back about one good prediction, here is another. Ever since Dave Anderson's presentation to the 2009 Storage Architecture meeting at the Library of Congress, I've been arguing that for flash to displace disk as the bulk storage medium would require flash vendors to make such enormous investments in new fab capacity that there would be no possibility of making an adequate return on the investments. Since the vendors couldn't make money on the investment, they wouldn't make it, and flash would not displace disk. 6 years later, despite the arrival of 3D flash that is still the case.

Source: Gartner & Stifel
Chris Mellor at The Register has the story in a piece entitled Don't want to fork out for NAND flash? You're not alone. Disk still rules. Its summed up in this graph, showing the bytes shipped by flash and disk vendors.It shows that the total bytes shipped is growing rapidly, but the proportion that is flash is about stable. Flash is:
expected to account for less than 10 per cent of the total storage capacity the industry will need by 2020.
Stifel estimates that:
Samsung is estimated to be spending over $23bn in capex on its 3D NAND for for an estimated ~10-12 exabytes of capacity.
If it is fully ramped-in by 2018 it will make about 1% of what the disk manufacturers will that year. So the investment to replace that capacity would be $2.3T, which clearly isn't going to happen. Unless the investment to make a petabyte of flash per year is much less than the investment to make a petabyte of disk, disk will remain the medium of choice for bulk storage.

Sunday, October 4, 2015

Pushing back against network effects

I've had occasion to note the work of Steve Randy Waldman before. Today, he has a fascinating post up entitled 1099 as Antitrust that may not at first seem relevant to digital preservation. Below the fold I trace the important connection.