DSHR's Blog: 2017

Thursday, December 28, 2017

Why Decentralize?

In Blockchain: Hype or Hope? (paywalled until June '18) Radia Perlman asks what exactly you get in return for the decentralization provided by the enormous resource cost of blockchain technologies? Her answer is:

a ledger agreed upon by consensus of thousands of anonymous entities, none of which can be held responsible or be shut down by some malevolent government ... [but] most applications would not require or even want this property.

Two important essays published last February by pioneers in the field provide different answers to Perlman's question:

Vitalik Buterin's answer in The Meaning of Decentralization is that what you get depends on what exactly you mean by "decentralization".
Nick Szabo's answer in Money, blockchains, and social scalability is "social scalability"

Below the fold I try to apply our experience with the decentralized LOCKSS technology to ask whether their arguments hold up. I'm working on a follow-up post based on Chelsea Barabas, Neha Narula and Ethan Zuckerman's Defending Internet Freedom through Decentralization from last August, which asks the question specifically about the decentralized Web and thus the idea of decentralized storage.

Updating Flash vs. Hard Disk

Chris Mellor at The Register has a useful update on the evolution of the storage market based on analysis from Aaron Rakers. Below the fold, I have some comments on it. In order to understand them you will need to have read my post The Medium-Term Prospects for Long-Term Storage Systems from a year ago.

Science Friday's "File Not Found"

Science Friday's Lauren Young has a three-part series on digital preservation:

Ghosts In The Reels is about magnetic tape.
The Librarians Saving The Internet is about Web archiving.
Data Reawakening is about the search for a quasi-immortal medium.

Clearly, increasing public attention to the problem of preserving digital information is a good thing, but I have reservations about these posts. Below the fold, I lay them out.

Bad Identifiers

This post on persistent identifiers (PIDs) has been sitting in my queue in note form for far too long. Its re-animation was sparked by an excellent post at PLOS Biologue by Julie McMurry, Lilly Winfree and Melissa Haendel entitled Bad Identifiers are the Potholes of the Information Superhighway: Take-Home Lessons for Researchers, which draws attention to a paper, Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data, of which they are three of the many authors. In addition, there were two papers at this year's iPRES on the topic;

Remco van Veenendaal et al's Getting Persistent Identifiers Implemented By ‘Cutting In The Middle-Man’ describes how the Dutch Digital Heritage Network (DHN) worked with vendors to implement PIDs.
Angela Dappert and Adam Farquhar's Permanence of the Scholarly Record: Persistent Identification and Digital Preservation – A Roadmap is a view from the British Library on how PIDs and digital preservation can be better integrated.

Below the fold, some thoughts on PIDs.

Cliff Lynch's Stewardship in the "Age of Algorithms"

Cliff Lynch has just published a long and very important article at First Monday entitled Stewardship in the "Age of Algorithms". It is a much broader look than my series The Amnesiac Civilization at the issues around providing the future with a memory of today's society.

Cliff accurately describes the practical impossibility of archiving the systems such as Facebook that today form the major part of most people's information environment and asks:

If we abandon the ideas of archiving in the traditional preservation of an artifact sense, it’s helpful to recall the stewardship goal here to guide us: to capture the multiplicity of ways in which a given system behaves over the range of actual or potential users. ... Who are these “users” (and how many of them are there)? How do we characterize them, and how do we characterize system behavior?

Then, with a tip of the hat to Don Waters, he notes that this problem is familiar in other fields:

they are deeply rooted in historical methods of anthropology, sociology, political science, ethnography and related humanistic and social science disciplines that seek to document behaviors that are essentially not captured in artifacts, and indeed to create such documentary artifacts

Unable to archive the system they are observing, these fields try to record and annotate the experience of those encountering the system; to record the performance from the audience's point of view. Cliff notes, and discusses the many problems with, the two possible kinds of audience for "algorithms":

Programs, which he calls robotic witnesses, and others call sock puppets. Chief among the problems here is that "algorithms" need robust defenses against programs posing as humans (see, for example, spam, or fake news).
Humans, which he calls New Nielson Families. Chief among the problems here is the detailed knowledge "algorithms" use to personalize their behaviors, leading to a requirement for vast numbers of humans to observe even somewhat representative behavior.

Cliff concludes:

From a stewardship point of view (seeking to preserve a reasonably accurate sense of the present for the future, as I would define it), there’s a largely unaddressed crisis developing as the dominant archival paradigms that have, up to now, dominated stewardship in the digital world become increasingly inadequate. ... the existing models and conceptual frameworks of preserving some kind of “canonical” digital artifacts ... are increasingly inapplicable in a world of pervasive, unique, personalized, non-repeatable performances. As stewards and stewardship organizations, we cannot continue to simply complain about the intractability of the problems or speak idealistically of fundamentally impossible “solutions.”
...
If we are to successfully cope with the new “Age of Algorithms,” our thinking about a good deal of the digital world must shift from artifacts requiring mediation and curation, to experiences. Specifically, it must focus on making pragmatic sense of an incredibly vast number of unique, personalized performances (including interaction with the participant) that can potentially be recorded or otherwise documented, or at least do the best we can with this.

I agree that society is facing a crisis in its ability to remember the past. Cliff has provided a must-read overview of the context in which the crisis has developed, and some pointers to pragmatic if unsatisfactory ways to address it. What I would like to see is a even broader view, describing this crisis as one among many caused by the way increasing returns to scale are squeezing out the redundancy essential to a resilient civilization.

Tuesday, December 5, 2017

International Digital Preservation Day

The Digital Preservation Coalition's International Digital Preservation Day was marked by a wide-ranging collection of blog posts. Below the fold, some links to and comments on, a few of them.

Intel's "Management Engine"

Back in May Erica Portnoy and Peter Eckersley, writing for the EFF's Deep Links blog, summed up the situation in a paragraph:

Since 2008, most of Intel’s chipsets have contained a tiny homunculus computer called the “Management Engine” (ME). The ME is a largely undocumented master controller for your CPU: it works with system firmware during boot and has direct access to system memory, the screen, keyboard, and network. All of the code inside the ME is secret, signed, and tightly controlled by Intel. ... there is presently no way to disable or limit the Management Engine in general. Intel urgently needs to provide one.

Recent events have pulled back the curtain somewhat and revealed that things are worse than we knew in May. Below the fold, some details.

Has Web Advertising Jumped The Shark?

The Web runs on advertising. Has Web advertising jumped the shark? The relevant Wikipedia article says:

The usage of "jump the shark" has subsequently broadened beyond television, indicating the moment when a brand, design, franchise, or creative effort's evolution declines, or when it changes notably in style into something unwelcome.

There are four big problems with Web advertising as it currently exists:

Bad guys love it.
Readers hate it.
Webmasters hate it.
Advertisers find it wastes money.

#4 just might have something to do with #3, #2 and #1. It seems that there's a case to be made. Below the fold I try to make it.

Techno-hype part 2

Don't, don't, don't, don't believe the hype!

Public Enemy

Enough about the hype around self-driving cars, now on to the hype around cryptocurrencies.

Sysadmins like David Gerard tend to have a realistic view of new technologies; after all, they get called at midnight when the technology goes belly-up. Sensible companies pay a lot of attention to their sysadmins' input when it comes to deploying new technologies.

Gerard's Attack of the 50 Foot Blockchain: Bitcoin, Blockchain, Ethereum & Smart Contracts is a must-read, massively sourced corrective to the hype surrounding cryptocurrencies and blockchain technology. Below the fold, some tidbits and commentary. Quotes not preceded by links are from the book, and I have replaced some links to endnotes with direct links.

Techno-hype part 1

Don't, don't, don't, don't believe the hype!

Public Enemy

New technologies are routinely over-hyped because people under-estimate the gap between a technology that works and a technology that is in everyday use by normal people.

You have probably figured out that I'm skeptical of the hype surrounding blockchain technology. Despite incident-free years spent routinely driving in company with Waymo's self-driving cars, I'm also skeptical of the self-driving car hype. Below the fold, an explanation.

Keynote at Pacific Neighborhood Consortium

I was invited to deliver a keynote at the 2017 Pacific Neighborhood Consortium in Tainan, Taiwan. My talk, entitled The Amnesiac Civilization, was based on the series of posts earlier this year with the same title. The theme was "Data Informed Society", and my abstract was:

What is the data that informs a society? It is easy to think that it is just numbers, timely statistical information of the kind that drives Google Maps real-time traffic display. But the rise of text-mining and machine learning means that we must cast our net much wider. Historic and textual data is equally important. It forms the knowledge base on which civilization operates.

For nearly a thousand years this knowledge base has been stored on paper, an affordable, durable, write-once and somewhat tamper-evident medium. For more than five hundred years it has been practical to print on paper, making Lots Of Copies to Keep Stuff Safe. LOCKSS is the name of the program at the Stanford Libraries that Vicky Reich and I started in 1998. We took a distributed approach; providing libraries with tools they could use to preserve knowledge in the Web world. They could work the way they were used to doing in the paper world, by collecting copies of published works, making them available to readers, and cooperating via inter-library loan. Two years earlier, Brewster Kahle had founded the Internet Archive, taking a centralized approach to the same problem.

Why are these programs needed? What have we learned in the last two decades about their effectiveness? How does the evolution of Web technologies place their future at risk?

Below the fold, the text of my talk.

Randall Munroe Says It All

The latest XKCD is a succinct summation of the situation, especially the mouse-over.

Tuesday, October 31, 2017

Storage Failures In The Field

It's past time for another look at the invaluable hard drive data that Backblaze puts out quarterly. As Peter Bright notes at Ars Technica, despite being based on limited data, the current stats reveal two interesting observations:

Backblaze is seeing reduced rates of infant mortality for the 10TB and 12TB drive generations:

The initial data from the 10TB and 12TB disks, however, has not shown that pattern. While the data so far is very limited, with 1,240 disks and 14,220 aggregate drive days accumulated so far, none of these disks (both Seagate models) have failed.
Backblaze is seeing no reliability advantage from enterprise as against consumer drives:

the company has now accumulated 3.7 million drive days for the consumer disks and 1.4 million for the enterprise ones. Over this usage, the annualized failure rates are 1.1 percent for the consumer disks and 1.2 percent for the enterprise ones.

Below the fold, some commentary.

Preserving Malware

Jonathan Farbowitz's NYU MA thesis More Than Digital Dirt: Preserving Malware in Archives, Museums, and Libraries is well worth a more leisurely reading than I've given it so far. He expands greatly on the argument I've made that preserving malware is important, and attempting to ensure archives are malware-free is harmful:

At ingest time, the archive doesn't know what it is about the content future scholars will be interested in. In particular, they don't know that the scholars aren't studying the history of malware. By modifying the content during ingest they may be destroying its usefulness to future scholars.

For example, Farbowitz introduces his third chapter A Series of Inaccurate Analogies thus:

In my research, I encountered several criticisms of both the intentional collection of malware by cultural heritage institutions and the preservation of malware-infected versions of digital artefacts. These critics have attempted to draw analogies between malware infection and issues that are already well-understood in the treatment and care of archival collections. I will examine each of these analogies to help clarify the debate and elucidate how malware fits within the collecting mandate of archives, museums, and libraries

He goes on to to demolish the ideas that malware is like dirt or mold. He provides several interesting real-world examples of archival workflows encountering malware. His eighth chapter Risk Assessment Considerations for Storage and Access is especially valuable in addressing the reasons why malware preservation is so controversial.

Overall, a very valuable contribution.

Tuesday, October 17, 2017

Will HAMR Happen?

For more than five years I've been skeptical of the storage industry's optimistic roadmaps in general, and the idea that HAMR (Heat Assisted Magnetic Recording) will replace the current PMR (Perpendicular Magnetic Recording) as the technology for hard disks any time soon. The first ship date for HAMR drives has been slipping in real time for nearly a decade, and last year Seagate slipped it again:

[Seagate] is targeting 2018 for HAMR drive deliveries, with a 16TB 3.5-inch drive planned, featuring 8 platters and 16 heads.

Now, Chris Mellor at The Register reports that:

WDC has given up on heat-assisted magnetic recording (HAMR) and is developing a microwave-assisted technique (MAMR) to push disk drive capacity up to 100TB by the 2030s.

It's able to do this with relatively incremental advances, avoiding the technological development barrier represented by HAMR. These developments include multi-stage head actuation and so-called Damascene head construction.

Below the fold, I assess this news.

Crowdfunding

ExoLife Finder

I've been a fairly enthusiastic crowdfunder for the past 5 years; I started with the Raspberry Pi. Most recently I backed the ExoLife Finder, a huge telescope using innovative technology intended to directly image the surfaces of nearby exoplanets. Below the fold, some of my history with crowdfunding to establish my credentials before I review some recent research on the subject.

IPRES 2017

Kyoto Railway Museum

Much as I love Kyoto, now that I'm retired with daily grandparent duties (and no-one to subsidize my travel) I couldn't attend iPRES 2017.

I have now managed to scan both the papers, and the very useful "collaborative notes" compiled by Micky Lindlar, Joshua Ng, William Kilbride, Euan Cochrane, Jaye Weatherburn and Rachel Tropea (thanks!). Below the fold I have some notes on the papers that caught my eye.

Living With Insecurity

My post Not Whether But When took off from the Equifax breach, attempting to explain why the Platonic ideal of a computer system storing data that is safe against loss or leakage cannot exist in the real world. Below the fold, I try to cover some of the implications of this fact.

OAIS & Distributed Digital Preservation

One of the lessons from the TRAC audit of the CLOCKSS Archive was the mis-match between the OAIS model and distributed digital preservation:

CLOCKSS has a centralized organization but a distributed implementation. Efforts are under way to reconcile the completely centralized OAIS model with the reality of distributed digital preservation, as for example in collaborations such as the MetaArchive and between the Royal and University Library in Copenhagen and the library of the University of Aarhus. Although the organization of the CLOCKSS Archive is centralized, serious digital archives like CLOCKSS require a distributed implementation, if only to achieve geographic redundancy. The OAIS model fails to deal with distribution even at the implementation level, let alone at the organizational level.

It is appropriate on the 19^th anniversary of the LOCKSS Program to point to a 38-minute video about this issue, posted last month. In it Eld Zierau lays out the Outer OAIS - Inner OAIS model that she and Nancy McGovern have developed to resolve the mis-match, and published at iPRES 2014.

They apply OAIS hierarchically, first to the distributed preservation network as a whole (outer), and then to each node in the network (inner). This can be useful in delineating the functions of nodes as opposed to the network as a whole, and in identifying the single points of failure created by centralized functions of the network as a whole.

While I'm promoting videos, I should also point to Arquivo.pt's excellent video for a general audience about the importance of Web archiving, with subtitles in English.

Tuesday, October 3, 2017

Not Whether But When

Richard Smith, the CEO of Equifax while the company leaked personal information on most Americans (and suffered at least one more leak that was active for about a year up to last March) was held accountable for these failings by being allowed to retire with a mere $90M. But at Fortune, John Patrick Pullen quotes him as uttering an uncomfortable truth:

"There's those companies that have been breached and know it, and there are those companies that have been breached and don't know it,"

Pullen points out that:

The speech, given by Smith to students and faculty at the university's Terry College of Business, covered a lot of ground, but it frequently returned to security issues that kept the former CEO awake at night—foremost among them was the company's large database.

Smith should have been losing sleep:

Though it was still 21 days before his company would reveal that it had been massively hacked, Equifax, at that time, had been breached and knew it.

Two years ago, the amazing Maciej Cegłowski gave one of his barn-burning speeches, entitled Haunted by Data (my emphasis):

imagine data not as a pristine resource, but as a waste product, a bunch of radioactive, toxic sludge that we don’t know how to handle. In particular, I'd like to draw a parallel between what we're doing and nuclear energy, another technology whose beneficial uses we could never quite untangle from the harmful ones. A singular problem of nuclear power is that it generated deadly waste whose lifespan was far longer than the institutions we could build to guard it. Nuclear waste remains dangerous for many thousands of years. This oddity led to extreme solutions like 'put it all in a mountain' and 'put a scary sculpture on top of it' so that people don't dig it up and eat it. But we never did find a solution. We just keep this stuff in swimming pools or sitting around in barrels.

The fact is that, just like nuclear waste, we have never found a solution to the interconnected problems of keeping data stored in real-world computer systems safe from attack and safe from leaking. It isn't a question of whether the bad guys will get in to the swimming pools and barrels of data, and exfiltrate it. It is simply when they will do so, and how long it will take you to find out that they have. Below the fold I look at the explanation for this fact. I'll get to the implications of our inability to maintain security in a subsequent post.

Web DRM Enables Innovative Business Model

Earlier this year I wrote at length about the looming disaster that was Web DRM, or the W3C's Encrypted Media Extensions (EME). Ten days ago, after unprecedented controversy, a narrow majority of W3C members made EME official.

So now I'm here to tell you the good news about how the combination of EME and the blockchain, today's sexiest technology, solves the most pressing issue for the Web, a sustainable business model. Innovators like the Pirate Bay and Showtime are already experimenting with it. They have yet to combine it with EME and gain the full benefit. Below the fold, I explain the details of this amazing new business opportunity. Be one of the first to effortlessly profit from the latest technology!

Sustaining Open Resources

Cambridge University Office of Scholarly Communication's Unlocking Research blog has an interesting trilogy of posts looking at the issue of how open access research resources can be sustained for the long term:

Dr. Lauren Cadwallader's Open Resources, who should pay
David Carr's Sustaining open research resources – a funder perspective
Dave Gerrard's Sustaining long-term access to open research resources – a university library perspective

Below the fold I summarize each of their arguments and make some overall observations.

Attacking (Users Of) The Wayback Machine

Right from the start, nearly two decades ago, the LOCKSS system assumed that:

Alas, even libraries have enemies. Governments and corporations have tried to rewrite history. Ideological zealots have tried to suppress research of which they disapprove.

The LOCKSS polling and repair protocol was designed to make it as difficult as possible for even a powerful attacker to change content preserved in a decentralized LOCKSS network, by exploiting excess replication and the lack of a central locus of control.

Just like libraries, Web archives have enemies. Jack Cushman and Ilya Kreymer's (CK) talk at the 2017 Web Archiving Conference identified seven potential vulnerabilities of centralized Web archives that an attacker could exploit to change or destroy content in the archive, or mislead an eventual reader as to the archived content.

Now, Rewriting History: Changing the Archived Web from the Present by Ada Lerner et al (L) identifies four attacks that, without compromising the archive itself, caused browsers using the Internet Archive's Wayback Machine to view pages that look different to the originally archived content. It is important to observe that the title is misleading, and that these attacks are less serious than those that compromise the archive. Problems with replaying archived content are fixable, loss or damage to archived content is not fixable.

Below the fold I examine L's four attacks and relate them to CK's seven vulnerabilities.

The Internet of Things is Haunted by Demons

This is just a quick note to get you to read Cory Doctorow's Demon-Haunted World. We all know that the Internet of Things is infested with bugs that cannot be exterminated. That's not what Doctorow is writing about. He is focused on the non-bug software in the Things that makes them do what their manufacturer wants, not what the customer who believes they own the Thing wants.

Long-Lived Scientific Observations

By BabelStone, CC BY-SA 3.0
Source

Keeping scientific data, especially observations that are not repeatable, for the long term is important. In our 2006 Eurosys paper we used an example from China. During the Shang dynasty:

astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.

Last week we had another, if only one-fifth as old, example of the value of long-ago scientific observations. Korean astronomers' records of a nova in 1437 provide strong evidence that:

1473 nova remains

"cataclysmic binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested. After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.

How were these 580-year-old records preserved? Follow me below the fold.

Josh Marshall on Google

Just a quick note to direct you to Josh Marshall's must-read A Serf on Google's Farm. It is a deep dive into the details of the relationship between Talking Points Memo, a fairly successful independent news publisher, and Google. It is essential reading for anyone trying to understand the business of publishing on the Web. Below the fold, pointers to a couple of other important works in this area.

Don't own cryptocurrencies

A year ago I ended a post entitled The 120K BTC Heist:

So in practice blockchains are decentralized (not), anonymous (not and not), immutable (not), secure (not), fast (not) and cheap (not). What's (not) to like?

Below the fold, I update the answer to the question with news you can use if you're a cryptocurrency owner.

Thursday, August 24, 2017

Why Is The Web "Centralized"?

There is a groundswell of opinion, which I share, in favor of a "decentralized Web" that has continued after last year's "Decentralized Web Summit". A wealth of different technologies for implementing a decentralized Web are competing for attention. But the basic protocols of the Internet and the Web (IP, TCP, DNS, HTTP, ...) aren't centralized. What is the centralization that decentralized Web advocates are reacting against? Clearly, it is the domination of the Web by the FANG (Facebook, Amazon, Netflix, Google) and a few other large companies such as the cable oligopoly.

These companies came to dominate the Web for economic not technological reasons. The Web, like other technology markets, has very large increasing returns to scale (network effects, duh!). These companies build centralized systems using technology that isn't inherently centralized but which has increasing returns to scale. It is the increasing returns to scale that drive the centralization.

Unless decentralized technologies specifically address the issue of how to avoid increasing returns to scale they will not, of themselves, fix this economic problem. Their increasing returns to scale will drive layering centralized businesses on top of decentralized infrastructure, replicating the problem we face now, just on different infrastructure.

Tuesday, August 22, 2017

Economic Model of Long-Term Storage

Cost vs. Kryder rate

As I wrote last month in Patting Myself On The Back, I started working on economic models of long-term storage six years ago. I got a small amount of funding from the Library of Congress; when that ran out I transferred the work to students at UC Santa Cruz's Storage Systems Research Center. This work was published here in 2012 and in later papers (see here).

What I wanted was a rough-and-ready Web page that would allow interested people to play "what if" games. What the students wanted was something academically respectable enough to get them credit. So the models accumulated lots of interesting details.

But the details weren't actually useful. The extra realism they provided was swamped by the uncertainty from the "known unknowns" of the future Kryder and interest rates. So I never got the rough-and-ready Web page. Below the fold, I bring the story up-to-date and point to a little Web site that may be useful.

Approaching The Physical Limits

As storage media technology gets closer and closer to the physical limits, progress on reducing the $/GB number slows down. Below the fold, a recap of some of these issues for both disk and flash.

Preservation Is Not A Technical Problem

As I've always said, preserving the Web and other digital content for posterity is an economic problem. With an unlimited budget collection and preservation isn't a problem. The reason we're collecting and preserving less than half the classic Web of quasi-static linked documents, and much less of "Web 2.0", is that no-one has the money to do much better.

The budgets of libraries and archives, the institutions tasked with acting as society's memory, have been under sustained attack for a long time. I'm working on a talk and I needed an example. So I drew this graph of the British Library's annual income in real terms (year 2000 pounds). It shows that the Library's income has declined by almost 45% in the last decade.

Memory institutions that can purchase only half what they could 10 years ago aren't likely to greatly increase funding for acquiring new stuff; it's going to be hard for them just to keep the stuff (and the staff) they already have.

Below the fold, the data for the graph and links to the sources.

Disk media market update

Its time for an update on the disk media market., based on reporting from The Register's Chris Mellor here and here and here.

Decentralized Long-Term Preservation

Lambert Heller is correct to point out that:

name allocation using IPFS or a blockchain is not necessarily linked to the guarantee of permanent availability, the latter must be offered as a separate service.

Storage isn't free, and thus the "separate services" need to have a viable business model. I have demonstrated that increasing returns to scale mean that the "separate service" market will end up being dominated by a few large providers just as, for example, the Bitcoin mining market is. People who don't like this conclusion often argue that, at least for long-term preservation of scholarly resources, the service will be provided by a consortium of libraries, museums and archives. Below the fold I look into how this might work.

Initial Coin Offerings

The FT's Alphaville blog has started a new series, called ICOmedy looking at the insanity surrounding Initial Coin Offerings (ICOs). The blockchain hype has created an even bigger opportunity to separate the fools from their money than the dot-com era did. To motivate you to follow the series, below the fold there are some extracts and related links.

Patting Myself On The Back

Cost vs. Kryder rate

I started working on economic models of long-term storage six years ago, and quickly discovered the effect shown in this graph. It plots the endowment, the money which, deposited with the data and invested at interest, pays for the data to be stored "forever", as a function of the Kryder rate, the rate at which $/GB drops with time. As the rate slows below about 20%, the endowment needed rises rapidly. Back in early 2011 it was widely believed that 30-40% Kryder rates were a law of nature, they had been that way for 30 years. Thus, if you could afford to store data for the next few years you could afford to store it forever

2014 cost/byte projection

As it turned out, 2011 was a good time to work on this issue. That October floods in Thailand destroyed 40% of the world's disk manufacturing capacity, and disk prices spiked. Preeti Gupta at UC Santa Cruz reviewed disk pricing in 2014 and we produced this graph. I wrote at the time:

The red lines are projections at the industry roadmap's 20% and a less optimistic 10%. [The graph] shows three things:

The slowing started in 2010, before the floods hit Thailand.

Disk storage costs in 2014, two and a half years after the floods, were more than 7 times higher than they would have been had Kryder's Law continued at its usual pace from 2010, as shown by the green line.

If the industry projections pan out, as shown by the red lines, by 2020 disk costs per byte will be between 130 and 300 times higher than they would have been had Kryder's Law continued.

Backblaze average $/GB

Thanks to Backblaze's admirable transparency, we have 3 years more data. Their blog reports on their view of disk pricing as a bulk purchaser over many years. It is far more detailed than the data Preeti was able to work with. Eyeballing the graph, we see a 2013 price around 5c/GB and a 2017 price around half that. A 10% Kryder rate would have meant a 2017 price of 3.2c/GB, and a 20% rate would have meant 2c/GB, so the out-turn lies between the two red lines on our graph. It is difficult to make predictions, especially about the future. But Preeti and I nailed this one.

This is a big deal. As I've said many times:

Storage will be
Much less free
Than it used to be

The real cost of a commitment to store data for the long term is much greater than most people believe, and there is no realistic prospect of a technological discontinuity that would change this.

Tuesday, July 11, 2017

Is Decentralized Storage Sustainable?

There are many reasons to dislike centralized storage services. They include business risk, as we see in le petit musée des projets Google abandonnés, monoculture vulnerability and rent extraction. There is thus naturally a lot of enthusiasm for decentralized storage systems, such as MaidSafe, DAT and IPFS. In 2013 I wrote about one of their advantages in Moving vs. Copying. Among the enthusiasts is Lambert Heller. Since I posted Blockchain as the Infrastructure for Science, Heller and I have been talking past each other. Heller is talking technology; I have some problems with the technology but they aren't that important. My main problem is an economic one that applies to decentralized storage irrespective of the details of the technology.

Below the fold is an attempt to clarify my argument. It is a re-statement of part of the argument in my 2014 post Economies of Scale in Peer-to-Peer Networks, specifically in the context of decentralized storage networks.

Archive vs. Ransomware

Archives perennially ask the question "how few copies can we get away with?"

This is a question I've blogged about in 2016 and 2011 and 2010, when I concluded:

The number of copies needed cannot be discussed except in the context of a specific threat model.

The important threats are not amenable to quantitative modeling.

Defense against the important threats requires many more copies than against the simple threats, to allow for the "anonymity of crowds".

I've also written before about the immensely profitable business of ransomware. Recent events, such as WannaCrypt, NotPetya and the details of NSA's ability to infect air-gapped computers should convince anyone that ransomware is a threat to which archives are exposed. Below the fold I look into how archives should be designed to resist this credible threat.

"to promote the progress of useful Arts"

This is just a quick note to say that anyone who believes the current patent and copyright systems are working "to promote the progress of useful Arts" needs to watch Bunnie Huang's talk to the Stanford EE380 course, and read Bunnie's book The Hardware Hacker. Below the fold, a brief explanation.

Wall Street Journal vs. Google

After we worked together at Sun Microsystems, Chuck McManis worked at Google then built another search engine (Blekko). His contribution to the discussion on Dave Farber's IP list about the argument between the Wall Street Journal and Google is very informative. Chuck gave me permission to quote liberally from it in the discussion below the fold.

WAC2017: Security Issues for Web Archives

Jack Cushman and Ilya Kreymer's Web Archiving Conference talk Thinking like a hacker: Security Considerations for High-Fidelity Web Archives is very important. They discuss 7 different security threats specific to Web archives:

Archiving local server files
Hacking the headless browser
Stealing user secrets during capture
Cross site scripting to steal archive logins
Live web leakage on playback
Show different page contents when archived
Banner spoofing

Below the fold, a brief summary of each to encourage you to do two things:

First, view the slides.
Second, visit http://warc.games., which is a sandbox with
a local version of Webrecorder that has not been patched to fix known exploits, and a number of challenges for you learn how they might apply to web archives in general.

Analysis of Sci-Hub Downloads

Bastian Greshake has a post at the LSE's Impact of Social Sciences blog based on his F1000Research paper Looking into Pandora's Box. In them he reports on an analysis combining two datasets released by Alexandra Elbakyan:

A 2016 dataset of 28M downloads from Sci-Hub between September 2015 and February 2016.
A 2017 dataset of 62M DOIs to whose content Sci-Hub claims to be able to provide access.

Below the fold, some extracts and commentary.

Emulation: Windows10 on ARM

At last December's WinHEC conference, Qualcomm and Microsoft made an announcement to which I should have paid more attention:

Qualcomm ... announced that they are collaborating with Microsoft Corp. to enable Windows 10 on mobile computing devices powered by next-generation Qualcomm® Snapdragon™ processors, enabling mobile, power efficient, always-connected cellular PC devices. Supporting full compatibility with the Windows 10 ecosystem, the Snapdragon processor is designed to enable Windows hardware developers to create next generation device form factors, providing mobility to cloud computing.

The part I didn't think about was:

New Windows 10 PCs powered by Snapdragon can be designed to support x86 Win32 and universal Windows apps, including Adobe Photoshop, Microsoft Office and Windows 10 gaming titles.

How do they do that? The answer is obvious: emulation! Below the fold, some thoughts.

Crowd-sourced Peer Review

At Ars Technica, Chris Lee's Journal tries crowdsourcing peer reviews, sees excellent results takes off from a column at Nature by a journal editor, Benjamin List, entitled Crowd-based peer review can be good and fast. List and his assistant Denis Höfler have come up with a pre-publication peer-review process that, while retaining what they see as its advantages, has some of the attributes of post-publication review as practiced, for example, by Faculty of 1000. See also here. Below the fold, some commentary.

Public Resource Audits Scholarly Literature

I (from personal experience), and others, have commented previously on the way journals paywall articles based on spurious claims that they own the copyright, even when there is clear evidence that they know that these claims are false. This is copyfraud, but:

While falsely claiming copyright is technically a criminal offense under the Act, prosecutions are extremely rare. These circumstances have produced fraud on an untold scale, with millions of works in the public domain deemed copyrighted, and countless dollars paid out every year in licensing fees to make copies that could be made for free.

The clearest case of journal copyfraud is when journals claim copyright on articles authored by US federal employees:

Work by officers and employees of the government as part of their official duties is "a work of the United States government" and, as such, is not entitled to domestic copyright protection under U.S. law. So, inside the US there is no copyright to transfer, and outside the US the copyright is owned by the US government, not by the employee. It is easy to find papers that apparently violate this, such as James Hansen et al's Global Temperature Change. It carries the statement "© 2006 by The National Academy of Sciences of the USA" and states Hansen's affiliation as "National Aeronautics and Space Administration Goddard Institute for Space Studies".

Perhaps the most compelling instance is the AMA falsely claiming to own the copyright on United States Health Care Reform: Progress to Date and Next Steps by one Barack Obama.

Now, Carl Malamud tweets:

Public Resource has been conducting an intensive audit of the scholarly literature. We have focused on works of the U.S. government. Our audit has determined that 1,264,429 journal articles authored by federal employees or officers are potentially void of copyright.

They extracted metadata from Sci-Hub and found:

Of the 1,264,429 government journal articles I have metadata for, I am now able to access 1,141,505 files (90.2%) for potential release.

This is already extremely valuable work. But in addition:

2,031,359 of the articles in my possession are dated 1923 or earlier. These 2 categories represent 4.92% of scihub. Additional categories to examine include lapsed copyright registrations, open access that is not, and author-retained copyrights.

It is long past time for action against the rampant copyfraud by academic journals.

Tip of the hat to James R. Jacobs.

Tuesday, May 30, 2017

Blockchain as the Infrastructure for Science? (updated)

Herbert Van de Sompel pointed me to Lambert Heller's How P2P and blockchains make it easier to work with scientific objects – three hypotheses as an example of the persistent enthusiasm for these technologies as a way of communicating and preserving research, among other things. Another link from Herbert, Chris H. J. Hartgerink's Re-envisioning a future in scholarly communication from this year's IFLA conference, proposes something similar:

Distributing and decentralizing the scholarly communications system is achievable with peer-to-peer (p2p) Internet protocols such as dat and ipfs. Simply put, such p2p networks securely send information across a network of peers but are resilient to nodes being removed or adjusted because they operate in a mesh network. For example, if 20 peers have file X, removing one peer does not affect the availability of the file X. Only if all 20 are removed from the network, file X will become unavailable. Vice versa, if more peers on the network have file X, it is less likely that file X will become unavailable. As such, this would include unlimited redistribution in the scholarly communication system by default, instead of limited redistribution due to copyright as it is now.

I first expressed skepticism about this idea three years ago discussing a paper proposing a P2P storage infrastructure called Permacoin. It hasn't taken over the world. [Update: my fellow Sun Microsystems alum Radia Perlman has a broader skeptical look at blockchain technology. I've appended some details.]

I understand the theoretical advantages of peer-to-peer (P2P) technology. But after nearly two decades researching, designing, building, deploying and operating P2P systems I have learned a lot about how hard it is for these theoretical advantages actually to be obtained at scale, in the real world, for the long term. Below the fold, I try to apply these lessons.

I'm Doing It Wrong (personal)

Retirement, that is. Nearly six months into retirement from Stanford, I can see my initial visions of kicking back in a La-Z-Boy with time to read and think were unrealistic.

First, until you've been through it, you have no idea how much paperwork getting to be retired involves. I'm still working on it. Second, I'm still involved in the on-going evolution of the LOCKSS architecture, and I'm now working with the Internet Archive to re-think the economic model of long-term storage I built with students at UC Santa Cruz. Third, I have travel coming up, intermittently sick grandkids, and a lot of sysadmin debt built up over the years on our home network. I haven't even started on the mess in the garage.

This is just a series of feeble excuses for why, as Atrios likes to say, "extra sucky blogging" for the next month or so. Sorry about that.

Thursday, May 18, 2017

"Privacy is dead, get over it" [updated]

I believe it was in 1999 that Scott McNealy famously said "privacy is dead, get over it". It is a whole lot deader now than it was then. A month ago in Researcher Privacy I discussed Sam Kome's CNI talk about the surveillance abilities of institutional network technology such as central wireless and access proxies. There's so much more to report on privacy that below the fold there can't be more than some suggested recent readings, as an update to my 6-month old post Open Access and Surveillance. [See a major update at the end]

Another Class of Blockchain Vulnerabilities

For at least three years I've been pointing out a fundamental problem with blockchain systems, and indeed peer-to-peer (P2P) systems in general, which is that maintaining their decentralized nature in the face of economies of scale (network effects, Metcalfe's Law, ...) is pretty close to impossible. I wrote a detailed analysis of this issue in Economies of Scale in Peer-to-Peer Networks. Centralized P2P systems, in which a significant minority (or in the case of Bitcoin an actual majority) can act in coordination perhaps because they are conspiring together, are vulnerable to many attacks. This was a theme of our SOSP "Best Paper" winner in 2003.

Now, Catalin Cimpanu at Bleeping Computer reports on research showing yet another way in which P2P networks can become vulnerable through centralization driven by economies of scale. Below the fold, some details.

Tape is "archive heroin"

I've been boring my blog readers for years with my skeptical take on quasi-immortal media. Among the many, many reasons why long media life, such as claimed for tape, is irrelevant to practical digital preservation is that investing in long media life is a bet against technological progress.

Now, at IEEE Spectrum, Marty Perlmutter's The Lost Picture Show: Hollywood Archivists Can’t Outpace Obsolescence is a great explanation of why tape's media longevity is irrelevant to long-term storage:

While LTO is not as long-lived as polyester film stock, which can last for a century or more in a cold, dry environment, it’s still pretty good.

The problem with LTO is obsolescence. Since the beginning, the technology has been on a Moore’s Law–like march that has resulted in a doubling in tape storage densities every 18 to 24 months. As each new generation of LTO comes to market, an older generation of LTO becomes obsolete. LTO manufacturers guarantee at most two generations of backward compatibility. What that means for film archivists with perhaps tens of thousands of LTO tapes on hand is that every few years they must invest millions of dollars in the latest format of tapes and drives and then migrate all the data on their older tapes—or risk losing access to the information altogether.

That costly, self-perpetuating cycle of data migration is why Dino Everett, film archivist for the University of Southern California, calls LTO “archive heroin—the first taste doesn’t cost much, but once you start, you can’t stop. And the habit is expensive.” As a result, Everett adds, a great deal of film and TV content that was “born digital,” even work that is only a few years old, now faces rapid extinction and, in the worst case, oblivion.

Note also that the required migration consumes a lot of bandwidth, meaning that in order to supply the bandwidth needed to ingest the incoming data you need a lot more drives. This reduces the tape/drive ratio, and thus decreases tape's apparent cost advantage. Not to mention that migrating data from tape to tape is far less automated and thus far more expensive than migrating between on-line media such as disk.

Tuesday, May 2, 2017

Distill: Is This What Journals Should Look Like?

A month ago a post on the Y Combinator blog announced that they and Google have launched a new academic journal called Distill. Except this is no ordinary journal consisting of slightly enhanced PDFs, it is a big step towards the way academic communication should work in the Web era:

The web has been around for almost 30 years. But you wouldn’t know it if you looked at most academic journals. They’re stuck in the early 1900s. PDFs are not an exciting form.

Distill is taking the web seriously. A Distill article (at least in its ideal, aspirational form) isn’t just a paper. It’s an interactive medium that lets users – “readers” is no longer sufficient – work directly with machine learning models.

Below the fold, I take a close look at one of the early articles to assess how big a step this is.

A decade of blogging

A decade ago today I posted Mass-market scholarly communication to start this blog. Now, 459 posts later I would like to thank everyone who has read and especially those who have commented on it.

Blogging is useful to me for several reasons:

It forces me to think through issues.
It prevents me forgetting what I thought when I thought through an issue.
Its a much more effective way to communicate with others in the same field than publishing papers.
Since I'm not climbing the academic ladder there's not much incentive for me to publish papers anyway, although I have published quite a few since I started LOCKSS.
I've given quite a few talks too. Since I started posting the text of a talk with links to the sources it has become clear that it is much more useful to readers than posting the slides.
I use the comments as a handy way to record relevant links, and why I thought they were relevant.

There weren't a lot of posts until in 2011 I started to target one post a week. I thought it would be hard to come up with enough topics, but pretty soon afterwards half-completed or note-form drafts started accumulating. My posting rate has accelerated smoothly since, and most weeks now get two posts. Despite this, I have more drafts lying around than ever.

Wednesday, April 19, 2017

Emularity strikes again!

The Internet Archive's massive collection of software now includes an in-browser emulation in the Emularity framework of the original Mac with MacOS from 1984 to 1989, and a Mac Plus with MacOS 7.0.1 from 1991. Shaun Nichols at The Register reports that:

The emulator itself is powered by a version of Hampa Hug's PCE Apple emulator ported to run in browsers via JavaScript by James Friend. PCE and PCE.js have been around for a number of years; now that tech has been married to the Internet Archive's vault of software.

Congratulations to Jason Scott and the software archiving team!

Thursday, April 13, 2017

Bufferbloat

This is just a brief note to point out that, after a long hiatus, my friend Jim Gettys has returned to blogging with Home products that fix/mitigate bufferbloat, an invaluable guide to products that incorporate some of the very impressive work undertaken by the bufferbloat project, CeroWrt, and the LEDE WiFi driver. The queuing problems underlying bufferbloat, the "lag" that gamers complain about and other performance issues at the edge of the Internet can make home Internet use really miserable. It has taken appallingly long for the home router industry to start shipping products with even the initial fixes released years ago. But a trickle of products is now available, and it is a great service for Jim to point at them.

Wednesday, April 12, 2017

Identifiers: A Double-Edged Sword

This is the last of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy, Research Access for the 21st Century, and The Orphans of Scholarship.

Geoff Bilder's Open Persistent Identifier Infrastructures: The Key to Scaling Mandate Auditing and Assessment Exercises was ostensibly a report on the need for and progress in bringing together the many disparate identifier systems for organizations in order to facilitate auditing and assessment processes. It was actually an insightful rant about how these processes were corrupting the research ecosystem. Below the fold, I summarize Geoff's argument (I hope Geoff will correct me if I misrepresent him) and rant back.

The Orphans of Scholarship

This is the third of my posts from CNI's Spring 2017 Membership Meeting. Predecessors are Researcher Privacy and Research Access for the 21st Century.

Herbert Van de Sompel, Michael Nelson and Martin Klein's To the Rescue of the Orphans of Scholarly Communication reported on an important Mellon-funded project to investigate how all the parts of a research effort that appear on the Web other than the eventual article might be collected for preservation using Web archiving technologies. Below the fold, a summary of the 67-slide deck and some commentary.

Research Access for the 21st Century

This is the second of my posts from CNI's Spring 2017 Membership Meeting. The first is Researcher Privacy.

Resource Access for the 21st Century, RA21 Update: Pilots Advance to Improve Authentication and Authorization for Content by Elsevier's Chris Shillum and Ann Gabriel reported on the effort by the oligopoly publishers to replace IP address authorization with Shibboleth. Below the fold, some commentary.

Researcher Privacy

The blog post I was drafting about the sessions I found interesting at the CNI Spring 2017 Membership Meeting got too long, so I am dividing it into a post per interesting session. First up, below the fold, perhaps the most useful breakout session. Sam Kome's Protect Researcher Privacy in the Surveillance Era, an updated version of his talk at the 2016 ALA meeting, led to animated discussion.

EU report on Open Access

The EU's ambitious effort to provide immediate open access to scientific publications as the default by 2020 continues with the publication of Towards a competitive and sustainable open access publishing market in Europe, a report commissioned by the OpenAIRE 2020 project. It contains a lot of useful information and analysis, and concludes that:

Without intervention, immediate OA to just half of Europe's scientific publications will not be achieved until 2025 or later.

The report:

considers the economic factors contributing to the current state of the open access publishing market, and evaluates the potential for European policymakers to enhance market competition and sustainability in parallel to increasing access.

Below the fold, some quotes, comments, and an assessment.

Threats to stored data

Recently there's been a lively series of exchanges on the pasig-discuss mail list, sparked by an inquiry from Jeanne Kramer-Smyth of the World Bank as to any additional risks posed by media such as disks that did encryption or compression. It morphed into discussion of the "how many copies" question and related issues. Below the fold, my reflections on the discussion.

The Amnesiac Civilization: Part 5

Part 2 and Part 3 of this series established that, for technical, legal and economic reasons there is much Web content that cannot be ingested and preserved by Web archives. Part 4 established that there is much Web content that can currently be ingested and preserved by public Web archives that, in the near future, will become inaccessible. It will be subject to Digital Rights Management (DRM) technologies which will, at least in most countries, be illegal to defeat. Below the fold I look at ways, albeit unsatisfactory, to address these problems.

The Amnesiac Civilization: Part 4

Part 2 and Part 3 of this series covered the unsatisfactory current state of Web archiving. Part 1 of this series briefly outlined the way the W3C's Encrypted Media Extensions (EME) threaten to make this state far worse. Below the fold I expand on the details of this threat.

SHA1 is dead

On February 23^rd a team from CWI Amsterdam (where I worked in 1982) and Google Research published The first collision for full SHA-1, marking the "death of SHA-1". Using about 6500 CPU-years and 110 GPU-years, they created two different PDF files with the same SHA-1 hash. SHA-1 is widely used in digital preservation, among many other areas, despite having been deprecated by NIST through a process starting in 2005 and becoming official by 2012.

There is an accessible report on this paper by Dan Goodin at Ars Technica. These collisions have already caused trouble for systems in the field, for example for Webkit's Subversion repository. Subversion and other systems use SHA-1 to deduplicate content; files with the same SHA-1 are assumed to be identical. Below the fold, I look at the implications for digital preservation.

The Amnesiac Civilization: Part 3

In Part 2 of this series I criticized Kalev Leetaru's Are Web Archives Failing The Modern Web: Video, Social Media, Dynamic Pages and The Mobile Web for failing to take into account the cost of doing a better job. Below the fold I ask whether, even with unlimited funds, it would actually be possible to satisfy Leetaru's reasonable-sounding requirements, and whether those requirements would actually solve the problems of Web archiving.

Dr. Pangloss and Data in DNA

Last night I gave a 10-minute talk at the Storage Valley Supper Club, an event much beloved of the good Dr. Pangloss. The title was DNA as a Storage Medium; it was a slightly edited section of The Medium-Term Prospects for Long-Term Storage Systems. Below the fold, an edited text with links to the sources.

The Amnesiac Civilization: Part 2

Part 1 of The Amnesiac Civilization predicted that the state of Web archiving would soon get much worse. How bad it is right now and why? Follow me below the fold for Part 2 of the series. I'm planning at least three more parts:

Part 3 will assess how practical some suggested improvements might be.
Part 4 will look in some detail at the Web DRM problem introduced in Part 1.
Part 5 will discuss a "counsel of despair" approach that I've hinted at in the past.

The Amnesiac Civilization: Part 1

Those who cannot remember the past are condemned to repeat it
George Santayana: Life of Reason, Reason in Common Sense (1905)

Who controls the past controls the future. Who controls the present controls the past.
George Orwell: Nineteen Eighty-Four (1949)

Santayana and Orwell correctly perceived that societies in which the past is obscure or malleable are very convenient for ruling elites and very unpleasant for the rest of us. It is at least arguable that the root cause of the recent inconveniences visited upon ruling elites in countries such as the US and the UK was inadequate history management. Too much of the population correctly remembered a time in which GDP, the stock market and bankers' salaries were lower, but their lives were less stressful and more enjoyable.

Two things have become evident over the past couple of decades:

The Web is the medium that records our civilization.
The Web is becoming increasingly difficult to collect and preserve in order that the future will remember its past correctly.

This is the first in a series of posts on this issue. I start by predicting that the problem is about to get much, much worse. Future posts will look at the technical and business aspects of current and future Web archiving. This post is shorter than usual to focus attention on what I believe is an important message

In a 2014 post entitled The Half-Empty Archive I wrote, almost as a throw-away:

The W3C's mandating of DRM for HTML5 means that the ingest cost for much of the Web's content will become infinite. It simply won't be legal to ingest it.

The link was to a post by Cory Doctorow in which he wrote:

We are Huxleying ourselves into the full Orwell.

He clearly understood some aspects of the problem caused by DRM on the Web:

Everyone in the browser world is convinced that not supporting Netflix will lead to total marginalization, and Netflix demands that computers be designed to keep secrets from, and disobey, their owners (so that you can’t save streams to disk in the clear).

Two recent developments got me thinking about this more deeply, and I realized that neither I nor, I believe, Doctorow comprehended the scale of the looming disaster. It isn't just about video and the security of your browser, important as those are. Here it is in as small a nutshell as I can devise.

Almost all the Web content that encodes our history is supported by one or both of two business models: subscription, or advertising. Currently, neither model works well. Web DRM will be perceived as the answer to both. Subscription content, not just video but newspapers and academic journals, will be DRM-ed to force readers to subscribe. Advertisers will insist that the sites they support DRM their content to prevent readers running ad-blockers. DRM-ed content cannot be archived.

Imagine a world in which archives contain no subscription and no advertiser-supported content of any kind.

Update: the succeeding posts in the series are:

Notes from FAST17

As usual, I attended Usenix's File and Storage Technologies conference. Below the fold, my comments on the presentations I found interesting.

Injecting Faults in Distributed Storage

I'll record my reactions to some of the papers at the 2017 FAST conference in a subsequent post. But one of them has significant implications for digital preservation systems using distributed storage, and deserves a post to itself. Follow me below the fold as I try to draw out these implications.

Bundled APCs Considered Harmful

The publishers ... are pushing bundled APCs to librarians as a way to retain the ability to extract monopoly rents. As the Library Loon perceptively points out:

The key aspect of Elsevier’s business model that it will do its level best to retain in any acquisitions or service launches is the disconnect between service users and service purchasers.

I just realized that there is another pernicious aspect of bundled APC (Author Processing Charges) deals such as the recent deal between the Gates Foundation and AAAS. It isn't just that the deal involves Gates paying over the odds. It is that AAAS gets the money without necessarily publishing any articles. This gives them a financial incentive to reject Gates-funded articles, which would take up space in the journal for which AAAS could otherwise charge an APC.

Thursday, February 23, 2017

Poynder on the Open Access mess

Do not be put off by the fact that it is 36 pages long. Richard Poynder's Copyright: the immoveable barrier that open access advocates underestimated is a must-read. Every one of the 36 pages is full of insight.

Briefly, Poynder is arguing that the mis-match of resources, expertise and motivation makes it futile to depend on a transaction between an author and a publisher to provide useful open access to scientific articles. As I have argued before, Poynder concludes that the only way out is for Universities to act:

As it happens, the much-lauded Harvard open access policy contains the seeds for such a development. This includes wording along the lines of: “each faculty member grants to the school a nonexclusive copyright for all of his/her scholarly articles.” A rational next step would be for schools to appropriate faculty copyright all together. This would be a way of preventing publishers from doing so, and it would have the added benefit of avoiding the legal uncertainty some see in the Harvard policies. Importantly, it would be a top-down diktat rather than a bottom-up approach. Since currently researchers can request a no-questions-asked opt-out, and publishers have learned that they can bully researchers into requesting that opt-out, the objective of the Harvard OA policies is in any case subverted.

Note the word "faculty" above. Poynder does not examine the issue that very few papers are published all of whose authors are faculty. Most authors are students, post-docs or staff. The copyright in a joint work is held by the authors jointly, or if some are employees working for hire, jointly by the faculty authors and the institution. I doubt very much that the copyright transfer agreements in these cases are actually valid, because they have been signed only by the primary author (most frequently not a faculty member), and/or have been signed by a worker-for-hire who does not in fact own the copyright.

Thursday, February 16, 2017

Postel's Law again

Eight years ago I wrote:

In RFC 793 (1981) the late, great Jon Postel laid down one of the basic design principles of the Internet, Postel's Law or the Robustness Principle:

"Be conservative in what you do; be liberal in what you accept from others."
Its important not to lose sight of the fact that digital preservation is on the "accept" side of Postel's Law,

Recently, discussion on a mailing list I'm on focused on the downsides of Postel's Law. Below the fold, I try to explain why most of these downsides don't apply to the "accept" side, which is the side that matters for digital preservation.

RFC 4810

A decade ago next month Wallace et al published RFC 4810 Long-Term Archive Service Requirements. Its abstract is:

There are many scenarios in which users must be able to prove the existence of data at a specific point in time and be able to demonstrate the integrity of data since that time, even when the duration from time of existence to time of demonstration spans a large period of time. Additionally, users must be able to verify signatures on digitally signed data many years after the generation of the signature. This document describes a class of long-term archive services to support such scenarios and the technical requirements for interacting with such services.

Below the fold, a look at how it has stood the test of time.

Thursday, December 28, 2017

Tuesday, December 26, 2017

Thursday, December 21, 2017

Tuesday, December 19, 2017

Thursday, December 7, 2017

Tuesday, December 5, 2017

Tuesday, November 28, 2017

Tuesday, November 21, 2017

Thursday, November 16, 2017

Tuesday, November 14, 2017

Monday, November 6, 2017

Wednesday, November 1, 2017

Tuesday, October 31, 2017

Thursday, October 19, 2017

Tuesday, October 17, 2017

Thursday, October 12, 2017

Tuesday, October 10, 2017

Thursday, October 5, 2017

Wednesday, October 4, 2017

Tuesday, October 3, 2017

Thursday, September 28, 2017

Tuesday, September 26, 2017

Tuesday, September 19, 2017

Tuesday, September 12, 2017

Tuesday, September 5, 2017

Friday, September 1, 2017

Tuesday, August 29, 2017

Monday, August 28, 2017

Thursday, August 24, 2017

Tuesday, August 22, 2017

Tuesday, August 8, 2017

Thursday, August 3, 2017

Tuesday, August 1, 2017

Thursday, July 27, 2017

Tuesday, July 25, 2017

Thursday, July 20, 2017

Tuesday, July 11, 2017

Thursday, July 6, 2017

Thursday, June 29, 2017

Tuesday, June 27, 2017

Thursday, June 22, 2017

Tuesday, June 20, 2017

Thursday, June 15, 2017

Tuesday, June 13, 2017

Thursday, June 8, 2017

Tuesday, May 30, 2017

Friday, May 26, 2017

Thursday, May 18, 2017

Tuesday, May 9, 2017

Thursday, May 4, 2017

Tuesday, May 2, 2017

Friday, April 21, 2017

Wednesday, April 19, 2017

Thursday, April 13, 2017

Wednesday, April 12, 2017

Tuesday, April 11, 2017

Monday, April 10, 2017

Friday, April 7, 2017

Tuesday, March 28, 2017

Thursday, March 23, 2017

Tuesday, March 21, 2017

Friday, March 17, 2017

Wednesday, March 15, 2017

Monday, March 13, 2017

Friday, March 10, 2017

Wednesday, March 8, 2017

Friday, March 3, 2017

Thursday, March 2, 2017

Tuesday, February 28, 2017

Thursday, February 23, 2017

Thursday, February 16, 2017

Tuesday, February 14, 2017