DSHR's Blog: 2013

Thursday, December 12, 2013

UK National Archive

Joe Fay at The Register has an interesting piece about a tour of the UK National Archive.

The archive has an excellent and comprehensive approach to preserving the UK government's Web presence:

It uses a crawler to trawl the UK government’s web estate, aiming to hit sites every six months. With the government looking to shutter many obscure or unloved sites, the pressure is on. The web archive currently stands at around 80TB, with the crawler pulling in 1.6TB a month. At time of writing, there are 3 billion urls in the archive, with 1 billion captured last year alone.But does anyone really care? Seems like they do. Espley said the archive gets around 15 to 20 million page views a month. This often maps to current events - the assumption being that visitors are often cross checking current government positions/statements against previous positions.

One must hope that the cross-checking doesn't turn up anything embarrassing enough to imperil the Archive's budget ...

Friday, December 6, 2013

One of the reasons for the slowing of Kryder's Law has been that the investment needed to get successive generations of disk technology into the market has been increasing. Assuming per-drive costs and volumes are approximately stable, this means that a technology generation has to stay in the market longer to recoup its development costs. Thus, even if the proportional density increase in each generation is the same, because the generations are spaced further apart, the result is a slower Kryder's Law.

Henry Samueli, CTO of Broadcom, makes the same point about Moore's Law. As the feature size of successive chip generations decreases, the cost of the manufacturing technology increases. And the techniques needed, such as FinFET and other 3D technologies, also slow down and increase the cost of using the manufacturing technology:

Process nodes themselves still have room to advance, but they may also be headed for a wall in about 15 years, Samueli said. After another three generations or so, chips will probably reach 5nm, and at that point there will be only 10 atoms from the beginning to the end of each transistor gate, he said. Beyond that, further advances may be impossible.
"You can't build a transistor with one atom," Samueli said. There's no obvious path forward at that point, either. "As of yet, we have not seen a viable replacement for the CMOS transistor as we've known it for the last 50 years."
... the ongoing bargain of getting more for less eventually will end, Samueli said. "We've been spoiled by these devices getting cheaper and cheaper and cheaper in every generation. We're just going to have to live with prices leveling off," he said.

Both of these are simply applications of the Law of Diminishing Returns.

Wednesday, December 4, 2013

The Memory Hole

Peter van Buren understands the 1984 analogy that drove us to assume a very powerful adversary when we designed the LOCKSS system a decade and a half ago.

Tuesday, November 26, 2013

In-browser emulation

Jeff Rothenberg's ground-breaking 1995 article Ensuring the Longevity of Digital Documents described and compared two techniques to combat format obsolescence; format migration and emulation, concluding that emulation was the preferred approach. As time went by and successive digital preservation systems went into production it became clear that almost all of them rejected Jeff's conclusion, planning to use format migration as their preferred response to format obsolescence. Follow me below the fold for a discussion on why this happened and whether it still makes sense.

Patio Perspectives at ANADP II: Preserving the Other Half

Vicky Reich and I moderated a session at ANADP II entitled Patio Perspectives Session 2: New Models of Collaborative Preservation. The abstract for the session said:

This session will explore how well current preservation models keep our evolving scholarly communication products accessible for the short and long term. Library and publisher practices are changing in response to scholars' needs and market constraints. Where are the holes in our current approaches, and how can they be filled? Or are completely new models required?

I gave a brief introductory talk; an edited text with links to the sources is below the fold.

Estimating Storage Costs

Ethan Miller points me to a paper on the cost of storage, How Much Does Storage Really Cost? Towards a Full Cost Accounting Model for Data Storage by Amit Kumar Dutta and Ragib Hasan (DH) of the University of Alabama, Birmingham. Unfortunately, the conference at which it was presented, GECON 2013, is one of those whose proceedings are published in Springer's awful Lecture Notes in Computer Science series, so no link. Below the fold, discussion of the relationship between DH and our on-going work on the economics of long-term storage.

The Bitcoin vulnerability

Last month I wrote a ten-year retrospective of some of the ideas underlying the LOCKSS anti-entropy protocol in our SOSP paper, relating them to recent work on securing SSL communications. This month Ittay Eyal and Emin Gun Sirer (ES) published an important paper describing a vulnerability in Bitcoin. There are two similarities between this attack and the stealth modification attack we examined in that paper:

The attack involves a conspiracy in which the members strategically switch between good and bad behavior. The defense involves randomizing the behavior of the peers. The general lesson is that predictable behavior by honest peers is often easy to exploit.
The attack involves deploying an army of Sybil peers that appear legitimate but are actually under the control of the conspiracy. The defense involves making peer operations expensive using a proof-of-work technique. The general lesson is that peer reputations cheaply acquired are worth what they cost.

Follow me below the fold for the details.

Fire at Internet Archive

A side building at the Internet Archive used for book-scanning was consumed by fire last night. The people, the data and the library are safe but the Internet Archive is asking for donations to help them rebuild. If you can afford to, please help; I just did.

Update: they need to replace an estimated $600K in scanning equipment, plus rebuild the building.

Tuesday, November 5, 2013

Cloud lock-in

Back in June I used the demise of Google Reader to list a number of business issues with using third-party cloud storage services for long-term digital preservation. Scott Gilbertson was one of the users who were left high and dry. He has an interesting piece at The Register about the process of recovering from the loss of Reader. He starts from the well-known but very apt quote:

If you're not paying for something, you're not the customer; you're the product being sold.

then points out that:

Just because you are paying companies like Google, Apple or Microsoft you might feel they are, some how, beholden to you. The companies are actually beholden only to their stockholders whose interests may or may not be aligned with your own, so will change services accordingly.

and, after pointing out how easy it is these days for users to run cloud-like services for themselves, ends up concluding:

If you aren't hosting your data, it's not your data.

Also, Joe McKendrick at ZDnet pointed me to the Open Group's interesting Cloud Computing Portability and Interoperability. Joe introduces it by saying:

Along with security, one of the most difficult issues with cloud platforms is the risk of vendor lock-in. By assigning business processes and data to cloud service providers, it may get really messy and expensive to attempt to dislodge from the arrangement if it's time to make a change.
The guide, compiled by a team led by Kapil Bakshi and Mark Skilton, provides key pointers for enterprises seeking to develop independently functioning clouds, as well as recommendations to the industry on standards that need to be adopted or extended.

It is mainly about avoiding getting locked-in to a vendor of cloud computing services rather than cloud storage services, so its focus is on open, standard interfaces to such services. But the main message of both pieces is that any time you are using cloud services, you need an up-to-date, fully costed exit strategy. Trying to come up with an exit strategy when you're given 13 days notice that you need one is guaranteed to be an expensive disaster.

Wednesday, October 30, 2013

Seagate's Ethernet Hard Drives

A week ago Seagate made an extraordinarily interesting announcement for long-term storage, their Kinetic Open Storage Platform, including disk drives with Ethernet connectivity. Below the fold, the details.

Trust in Computer Systems

Ten years ago our paper Preserving Peer Replicas by Rate-Limited Sampled Voting was awarded "Best Paper" at the 2003 ACM Symposium on Operating System Principles. It was later expanded into a paper in ACM Transactions on Computing Systems, The LOCKSS peer-to-peer digital preservation system. According to Google Scholar, the twin papers have accumulated over 300 citations. Below the fold I discuss the ideas about trust we put forward in these papers that have turned out to have lasting significance, and are particularly important today.

Special Issue of Science

I'd like to draw attention to the special issue of Science on scientific communication, which the AAAS has made freely available. It is all worth reading but these pieces caught my eye:

Tania Rabesandratana's profile of Vitek Tracz, highlighting his roles in open access publishing via BioMed Central and Faculty of 1000, and now in post-publication peer review via F1000 Research. Kudos to Science for this article on a competitor!
Randall Munroe's infographic.
Diane Harley's corrective to the Web 2.0 hype, stressing that change will come only as institutions change their tenure and promotion practices to reduce their emphasis on flawed measures such as impact factor.

Monday, October 14, 2013

The Major Threat is Economic

Annalee Newitz at io9.com reviews an essay by Heather Phillips entitled "The Great Library at Alexandria?". She makes two points about an institution often cited in discussions of digital preservation that have particular resonance at a time when the US government is mostly shut down, and may be forced into default.

First, contrary to popular myth, it appears the library was not destroyed overnight by fire, but decayed slowly over a long period of time as its initially lavish budget was repeatedly cut. Phillips writes:

Though it seems fitting that the destruction of so mythic an institution as the Great Library of Alexandria must have required some cataclysmic event . . . in reality, the fortunes of the Great Library waxed and waned with those of Alexandria itself. Much of its downfall was gradual, often bureaucratic, and by comparison to our cultural imaginings, somewhat petty.

As I've frequently said, the biggest threat to the long-term survival of digital materials is economic. This isn't something new.

Second, the importance of the library was not its collection, but the synergy between its collection and the scholars it attracted. Newitz writes:

What made the Museum and its daughter branch great were its scholars. And when the Emperor abolished their stipends, and forbade foreign scholars from coming to the library, he effectively shut down operations. Those scrolls and books were nothing without people to care for them, study them, and share what they learned far and wide.

What matters isn't the perfection of a collection, but the usefulness of a collection. Digital preservation purists may scorn the Internet Archive, but as I write this post Alexa ranks archive.org the 167th most used site on the Internet. For comparison, the Library of Congress is currently the 4,212st ranked site (and is up despite the shutdown), the Bibliothèque Nationale de France is ranked 16,274 and the British Library is ranked 29.498. Little-used collections, such as dark archives, post-cancellation only archives, and access-restricted copyright deposit collections are all at much greater economic risk in the long term than widely used sites such as the Internet Archive.

Of course, many of the important (and thus well-used) works from the Library of Alexandria survived because their importance meant that there were lots of copies. Newitz writes:

Even this account of the burning has to be taken with a grain of salt. The first stories of it appear hundreds of years after the events that took place, and historians aren't sure whether it's accurate. Canfora also notes that by the time this alleged destruction took place, the men who cared for the library were aware that many of its important works were in circulation elsewhere in the world. Major centers of learning had been established in India and Central Asia, along the great Silk Road, where nomadic scholars wandered between temples that were stocked with books.

Tuesday, October 8, 2013

Hybrid Disk Drives

Dave Anderson gave an interesting talk at the Library of Congress' Designing Storage Architectures meeting on Seagate's enterprise hybrid disk drives. Details below the fold.

It was fifteen years ago today

Fifteen years ago today Vicky Reich and I were hiking the Cañada de Pala trail at Joseph D. Grant County Park when we came up with the idea for the LOCKSS technology. The next day we pitched the idea to Michael Keller, the Stanford Librarian, and got permission to start the project. As I recall, Michael told us:

Don't cost me any money.
Don't get me in to trouble.
Do what you like.

You can't ask better than that. The name for the project came later, on a rather muddy hike to Berry Creek Falls in Big Basin Redwoods State Park.

Tuesday, October 1, 2013

End-of-life in the Cloud

The Register reports on the demise of Nirvanix, an enterprise cloud storage startup. Nirvanix customers were told on Sept. 18:

Customers had to get all their data out by the end of September or, in effect, face losing it.

They had 13 days to do it. Below the fold I ask what would happen if Amazon made a similar announcement about S3 - not because I think that is possible but to show how impossible it is.

Panel at Library of Congress Storage Architectures meeting

Henry Newman and I ran a panel at the Library of Congress' Storage Architectures meeting entitled Cloud Challenges. Below the fold is the text of my brief presentation, entitled Cloud Services: Caveat Emptor, with links to the sources.

Worth reading

Below the fold, quick comments on two good reads.

Becoming a better scientist

I'm really interested in the work the Force11 group and many others are doing to apply the techniques of software engineering and digital preservation to making science more reproducible and re-usable. Unfortunately, work for our recent Mellon Foundation grant and for the TRAC audit of the CLOCKSS Archive has meant I've been too busy to contribute or even pay much attention. But a discussion that broke out on the Force11 mailing list sparked by Paul Groth pointing to a post on his blog called Becoming a better scientist (reproducibility edition) really grabbed my attention. Follow me below the fold for the details.

"Preservation at Scale" at iPRES2013

I took part in the Preservation at Scale (PDF) workshop at iPRES2013. Below the fold is an edited text of my presentation, entitled Diversity and Risk at Scale, with links to the sources.

Noteworthy papers at iPRES2013

Below the fold I discuss some papers from iPres2013 that I found particularly interesting.

Talk for "RDF Vocabulary Preservation" at iPres2013

The group planning a session on "RDF Vocabulary Preservation" at iPRES2013 asked me to give a brief presentation on the principles behind the LOCKSS technology. Below the fold is an edited text with links to the sources.

More on storing "all that stuff'

In a post last May I expressed skepticism about the claims that the organizations on the dark side could store yottabytes of data, for example at the Utah data center. I wasn't alone; here for example from June is Mark Burnett on the same theme. The skeptical chorus has had some effect; the Wikipedia article has been edited to remove this hyperbolic claim:

a data storage facility for the United States Intelligence Community that is designed to be a primary storage resource capable of storing data on the scale of yottabytes.

In this clip from a NOVA documentary from last January entitled "Rise of the Drones", Yannis Antoniades of BAE Systems discusses the Argus camera used for drone surveillance. The video claims:

Argus streams live to the ground and also stores everything, a million terabytes of video a day, ...

Below the fold, lets look at this seemingly innocuous claim.

Annotations

Caroline O'Donovan at the Nieman Journalism Lab has an interesting article entitled Exegesis: How early adapters, innovative publishers, legacy media companies and more are pushing toward the annotated web. She discusses the way media sites including The New York Times, The Financial Times, Quartz and SoundCloud and platforms such as Medium are trying to evolve from comments to annotations as a way to improve engagement with their readers. She also describes the work hypothes.is is doing to build annotations into the Web infrastructure. There is also an interesting post on the hypothes.is blog from Peter Brantley on a workshop with journalists. Below the fold, some thoughts on the implications for preserving the Web.

Winston Smith Lives!

Three years ago I wrote a post on the importance of a tamper-resistant system for government documents, and another a year ago. Governments cannot resist the temptation to re-write history to their advantage, and every so often they get caught, which is an excuse for me to repeat the message. Below the fold, this year's version of the message.

Bit.ly's Plan Z

One of the big worries the evolution of the Web has been posing for preservation is the spread of URL shortening services. A failure of one of these services would break a vast number of preserved links. The more polite version of the two description of the problem at urlte.am is:

The URLTeam is the ArchiveTeam subcommittee on URL shorteners. We believe that they pose a serious threat to the internet's integrity. If one of them dies, gets hacked or sells out, millions of links will stop working.

In a fascinating keynote at Digital Preservation 2013, Hilary Mason dropped a hint to ask her after the talk about Bit.ly's Plan Z. So we did. It turns out that Plan Z is Bit.ly doing the right thing, for themselves and to some extent for the world.

Every time Bit.ly shortens a URL, they write a static redirect from the short to the long URL into Amazon's S3. Thus, if their service ever goes down, as for example it did when someone unplugged their data center, all they need need to do is to update the DNS record for bit.ly to point to S3, and the short URLs continue to resolve as they used to.

This is fine for Bit.ly itself and it is a big step towards being fine for the world. Of course, if Bit.ly were to go out of business, they would stop paying the S3 charges and for the bit.ly domain name, and their Plan Z wouldn't help the Web survive intact.

Archive Team has a group called Urlteam that works to back up URL shortening services, compiling a map from short to long URL for each and exposing the result as a torrent that can be downloaded and preserved. This is great, as it gets the data out of the custody of the service but, again, it doesn't on its own solve the problem of keeping the links intact after the service implodes.

If we can get it deployed, Memento is the missing piece of the solution. It is designed to ensure that links resolve even after the original target has gone away. As far as Memento is concerned, a shortened URL is no different from any other URL. If, after the URL shortening service dies, one or more of the backups can provide a site like Plan Z's that exports the static redirects and supports Memento then browsers that support Memento will continue to see the shortened URLs resolve as they originally did.

Wednesday, July 24, 2013

Talk at Digital Preservation 2013

I was on a panel at the Library of Congress' Digital Preservation 2013 meeting entitled Green Bytes: Sustainable Approaches to Digital Stewardship. Below the fold is the text of my brief presentation, with links to the sources.

Immortal media

Announcements of media technologies with very long lives happen regularly. In 2011 it was stone DVDs, in 2012 it was DNA, now it is a team from Southampton University and Eindhoven's 5-dimensional quartz DVDs. Note that in two presentations at the IEEE Mass Storage meeting back in April Hitachi announced that they already have similar but lower-density DVD technology. Below the fold, my reactions.

The Library of Congress' "Preserving.exe" meeting

In May the Library of Congress held a meeting entitled Preserving.exe: Toward a National Strategy for Preserving Software. I couldn't be there and I've only just managed to read the presentations and other materials. Three quick reactions below the fold.

Dr. Pangloss and the Road-Maps

The renowned Dr. Pangloss takes great pleasure in studying the storage industry's road-maps with their rosy view of the future. I've frequently pointed to this 2008 Seagate road-map for disk technology, showing Perpendicular Magnetic Recording (PMR) being supplanted by Heat Assisted Magnetic Recording (HAMR) starting in about 2009. Its 5 years later and no vendor has yet shipped HAMR drives, although HAMR has been demonstrated in the lab at over a trillion bits per square inch, about a 30% improvement over the best current PMR. This illustrates that these vendor road-maps tend to err on the optimistic side. Dr. Pangloss is rubbing his hands with glee at the vendor's latest road-maps; below the fold I look at why he is so happy.

Economics of Evil

Back in March Google announced that this weekend is the end of Google Reader, the service many bloggers and journalists used to use to read online content via RSS. This wasn't the first service Google killed, but because the people who used the service write for the Web, the announcement sparked a lively discussion. Because many people believe that commercial content platforms and storage services will preserve digital content for the long term, the discussion below the fold should be of interest here.

The Big Deal

Andrew Odlyzko has a fascinating paper with a rather long title, Open Access, library and publisher competition, and the evolution of general commerce (PDF). He describes how the relationship between the libraries and the publishers in the market for academic journals has evolved to transfer resources from libraries to the publishers, and how a similar strategy might play out in many, more general markets. Below the fold I discuss some of the details, but you should read the whole thing.

Petabyte DVD?

Simon Sharwood at The Register points to a paper in Nature Communications (and a more readable explanation) by a team from Swinburne University of Technology that may eventually allow for a petabyte on a single DVD platter.

They have found a way around Abbe's limit, which restricts the width of a light beam to be more than half its wavelength. They use two beams, each of which on its own is more than half a wavelength wide. One is round, and one is donut shaped and they overlap. Then, as with normal DVDs, they use a medium which contains a dye activated by the round beam. The secret is that the donut-shaped beam prevents the dye being activated. So the size of the written spot on the medium is the size of the hole in the donut, in their case only 9nm across. With 9nm dots it is in theory possible to get a petabyte on a DVD.

However, as the Library of Congress and others have observed, dye-based DVD media typically have a short data retention life even at current feature sizes, much larger than 9nm. So although the capacity of the DVDs the team envisages is impressive, they aren't likely to be much use for digital preservation.

Tuesday, June 18, 2013

Not trusting cloud storage

I'm trying to work through my stack of half-completed blog posts. Some months ago Jane Mandelbaum at the Library of Congress pointed me to Towards self-repairing replication-based storage systems using untrusted clouds by Bo Chen and Reza Curtmola. It received an “Outstanding Paper Award” at CODASPY 2013. Let me start by saying that the technique they describe is interesting and, as far as I can tell, represents an advance in some important respects on earlier work.

However, it is also an example, if not a severe one, of the problem I discussed in my post Journals Considered Harmful of authors hyping their work in order to be published in a higher-profile forum, and reviewers failing to catch this exaggeration. Follow me below the fold for the details.

Cliff Lynch

Two quick plugs. The first for Mike Ashenfelder's profile of Cliff Lynch in the Library of Congress' Digital Preservation Pioneer series. Cliff has helped the LOCKSS Program in too many ways to count. Personally, I'm particularly grateful for his occasional invitations to speak to his class at UC Berkeley's School of Information. They have provided an essential spur to get me to pull my thoughts together in several important areas.

The second is Cliff's article on e-books for American Libraries recent e-book supplement. There is a lot to digest in it. I hope to return to some aspects in a later post, but his conclusion succinctly describes the threat to libraries:

If we have not come to reasonable terms about e-books both the access and preservation functions of our libraries will be gravely threatened, and as a society, we will face a profound public policy problem. It is in every-one's interest, I believe, to avoid this crisis.

Thursday, June 13, 2013

Brief talk at ElPub 2013

I was on the panel entitled Setting Research Data Free: Problems and Solutions at the ElPub 2013 conference. Below the fold is the text of my introductory remarks with links to the sources.

Maureen Pennock's "Web Archiving" report

Under the auspices of the Digital Preservation Coalition, Maureen Pennock has written a very comprehensive overview of Web Archiving. It is an excellent introduction to the field, and has a lot of useful references.

Thursday, May 23, 2013

How dense can storage get?

James Pitt has an interesting if not terribly useful post at Quora comparing the Bekenstein Bound, the absolute limit that physics places on the density of information, with Harvard's DNA storage experiment. He concludes:

the best DNA storage can do with those dimensions [a gram of dry DNA] is 5.6*10¹⁵ bits.

A Bekenstein-bound storage device with those dimensions would store about 1.6*10³⁸ bits.

So, there is about a factor of 3*10²² in bits/gram beyond DNA. He also compares the Bekenstein limit with Stanford's electronic quantum holography, which stored 35 bits per electron. A Bekenstein-limit device the size of an electron would store 6.6*10⁷ bits, so there's plenty of headroom there too. How reliable storage media this dense, and what their I/O bandwidth would be are open questions, especially since the limit describes the information density of a black hole.

Thursday, May 16, 2013

A sidelight on "A Petabyte for a Century"

In my various posts over the last six years on A Petabyte For A Century I made the case that the amounts of data and the time for which they needed to be kept had reached the scale at which the reliability needed was infeasible. I'm surprised that I don't seem to have referred to the parallel case being made in high-performance computing, most notably in a 2009 paper, Toward Exascale Resilience by Franck Cappello et al:

From the current knowledge and observations of existing large systems, it is anticipated that Exascale systems will experience various kind of faults many times per day. It is also anticipated that the current approach for resilience, which relies on automatic or application level checkpoint-restart, will not work because the time for checkpointing and restarting will exceed the mean time to failure of a full system.

Here is a fascinating presentation by Horst Simon of the Lawrence Berkeley Lab, who has bet against the existence of an Exaflop computer before 2020. He points out all sorts of difficulties in the way other than reliability, but the key slide is #35 which does include a mention of reliability. This slide makes the same case as Cappello et al on much broader arguments, namely that to get more than an order of magnitude or so beyond our current HPC technology will take a complete re-think of the programming paradigm. Among the features required of the new programming paradigm is a recognition that errors and failures are inevitable and there is no way for the hardware to cover them up. The same is true of storage.

Tuesday, May 14, 2013

The value that publishers add

Here is Paul Krugman pointing out how much better econoblogs are doing at connecting economics and policy than traditional publishing. He brings out several of the points I've been making since the start of this blog six years ago.

First, speed:

The overall effect is that we’re having a conversation in which issues get hashed over with a cycle time of months or even weeks, not the years characteristic of conventional academic discourse.

Second, the corruption of the reviewing process:

In reality, while many referees do their best, many others have pet peeves and ideological biases that at best greatly delay the publication of important work and at worst make it almost impossible to publish in a refereed journal. ... anything bearing on the business cycle that has even a vaguely Keynesian feel can be counted on to encounter a very hostile reception; this creates some big problems of relevance for proper journal publication under current circumstances.

Third, reproducibility:

Look at one important recent case ... Alesina/Ardagna on expansionary austerity. Now, as it happens the original A/A paper was circulated through relatively “proper” channels: released as an NBER working paper, then published in a conference volume, which means that it was at least lightly refereed. ... And how did we find out that it was all wrong? First through critiques posted at the Roosevelt Institute, then through detailed analysis of cases by the IMF. The wonkosphere was a much better, much more reliable source of knowledge than the proper academic literature.

And here's yet another otherwise good review of the problems of scientific publishing that accepts Elsevier's claims as to the value they add, failing to point out the peer reviewed research into peer review that conclusively refutes these claims. It does, however include a rather nice piece of analysis from Deutsche Bank:

We believe the [Elsevier] adds relatively little value to the publishing process. We are not attempting to dismiss what 7,000 people at [Elsevier] do for a living. We are simply observing that if the process really were as complex, costly and value-added as the publishers protest that it is, 40% margins wouldn’t be available.

As I pointed out using 2010 numbers:

The world's research and education budgets pay [Elsevier, Springer & Wiley] about $3.2B/yr for management, editorial and distribution services. Over and above that, the worlds research and education budgets pay the shareholders of these three companies almost $1.5B for the privilege of reading the results of research (and writing and reviewing) that these budgets already paid for.

What this $4.7B/yr pays for is a system which encourages, and is riddled with, error and malfeasance. If these value-subtracted aspects were taken into account, it would be obvious that the self-interested claims of the publishers as to the value that they add were spurious.

Tuesday, May 7, 2013

Storing "all that stuff"

In two CNN interviews former FBI counter-terrorism specialist Tim Clemente attracted a lot of attention when he said:

"We certainly have ways in national security investigations to find out exactly what was said in that conversation. ... No, welcome to America. All of that stuff is being captured as we speak whether we know it or like it or not." and "all digital communications in the past" are recorded and stored

Many people assumed that the storage is in the Utah Data Center which, according to Wikipedia:

is a data storage facility for the United States Intelligence Community that is designed to be a primary storage resource capable of storing data on the scale of yottabytes

Whatever the wisdom of collecting everything, I'm a bit skeptical about the practicality of storing it. Follow me below the fold for a look at the numbers.

Talk on LOCKSS Metadata Extraction at IIPC 2013

I gave a brief introduction to the way the LOCKSS daemon extracts metadata from the content it collects at the 2013 IIPC General Assembly. Below the fold is an edited text with links to the sources.

Talk on Harvesting the Future Web at IIPC2013

I gave a talk to introduce the workshop "Future Web Capture: Replay, Data Mining and Analysis" at the 2013 IIPC General Assembly. It was based on my talk at the Spring CNI meeting. Below the fold is an edited text with links to the sources.

Software obsolescence doesn't imply format obsolescence

Tim Anderson at The Register celebrates the 20^th anniversary of Mosaic:

Using the DOSBox emulator (the Megabuild version which has network connectivity via an emulated NE2000 NIC) I ran up Windows 3.11 with Trumpet Winsock and got Mosaic 1.0 running.

This illustrates two important points:

Tim had no trouble resuscitating a 20-year-old software environment using off-the-shelf emulation.
The 20-year-old browser struggled to make sense of today's web. But today's browsers have no difficulty at all with vintage web pages.

The fact that the software that originally interpreted the content is obsolete (a) does not meant that there is significant difficulty in running it, and (b) does not mean that you need to use emulation to run it in order to interpret the content, because the obsolescence of the software does not imply the obsolescence of the format. Backwards compatibility is a feature of the Web, for reasons I have been pointing out for many years.

Thursday, April 25, 2013

Moore, Kryder vs. SAW

Ashish Sood et al's paper Predicting the Path of Technological Innovation: SAW vs. Moore, Bass, Gompertz, and Kryder is very interesting. They propose a discontinuous model in which technology evolves in steps, separated by periods of stasis they call waits, leading them to dub the model SAW (Step And Wait). They show that it models the evolution of a wide range of technologies better than continuous models such as Moore's and Kryder's laws. Our work on the economics of long-term storage is based on Kryder's law, a continuous model. Below the fold I ask whether we need to change models.

Making Memento Succesful

I gave a talk at the IIPC General Assembly on the problems facing Memento as it attempts the transition from a technology to a ubiquitous part of the Web's infrastructure. It was based on my earlier posts on Memento, my talk at the recent CNI and discussions with the Memento team, and intended to provide the background for subsequent talks from Herbert van de Sompel and Michael Nelson. Below the fold is an edited text with links to the sources.

It isn't just Kryder's Law

The drastic fall-off in PC shipments as demand switches to tablets isn't just affecting the prospects for Kryder's law reducing storage media costs, even for 2.5" drives:

PC sales are in terminal decline thanks to the continued popularity of tablets and there’s nothing an anticipated surge in ultramobiles can do to stop it.
...

Gartner has estimated that this year will see 2.4 billion devices shipped – that’s PCs, tablets and mobile phones combined – growing nine per cent over 2012.

...
The number of PCs sold in 2013 will fall 7.6 per cent compared to 2012, to 315 million units, with the only bright spot being ultramobiles, which will increase 140 per cent to 23 million units.

Tablet shipments will surge 69 per cent to 197 million units, while smartphones will make up an ever-increasing slice of the mobile phone pie. Of the 1.875 billion mobile phones Gartner predicts will be sold in 2013, a whopping 1 billion units are predicted to be smartphones, compared with 675 million units in 2012 (out of 1.746 billion).

It is affecting the prospects for Moore's law reducing the costs of the servers that drive them:

But with memory prices stabilizing after years of double-digit drops, analysts said that DDR3 DRAM will likely have a longer-than-expected life, which could delay the wide adoption of DDR4 in computers. DRAM prices have stabilized as demand for DDR3 has exceeded supply, and the number of memory makers has also dwindled. ...

The volume shipments of PCs and servers are not enough to justify an early switch to DDR4, analysts said. Also, a lot of focus is now on the fast-growing tablet and smartphone markets, so manufacturers are shifting capacity to LPDDR3 and other forms of mobile memory and storage.

Monday, April 8, 2013

Notes from Spring 2013 CNI

Below the fold are some notes from the sessions I attended at the Spring 2013 CNI meeting.

Talk at Spring 2013 CNI

Kris Carpenter Negulescu and I gave talks at the Spring 2013 CNI meeting in a project briefing entitled "Its Not Your Grandfather's Web Any Longer". They were based on the workshop we ran at the 2012 IIPC meeting at the Library of Congress looking at the problems of harvesting and preserving the future Web. I talked about the problems the workshop identified and Kris talked about the solutions people are working on. Below the fold is an edited text of my part of the talk with links to the sources.

More on Amazon's Margins

I'm not the only one doing the math to show the extortionate margins Amazon enjoys on its S3 cloud storage business. Over at The Register Simon Sharwood uses an announcement about Amazon's Cloud Drive service and a comparison with the competing Dropbox service, which runs on S3, to draw the same conclusion. He shows that, unless either Amazon or Dropbox are losing money, S3's costs must be much less than 3.7c/GB/mo:

5000 terabytes is 5,120,000 gigabytes. At $0.037 a gigabyte a month, Dropbox would have a bill of $189,440 a month. At $9.99 a month for 100 gigabytes of data, Dropbox needs 18,963 paying customers to meet that bill. 18,963 times 100 gigabytes is 1,896,296, which leaves 3,223,704 gigabytes of space Dropbox can dole out to its non-paying users. That's not 96 per cent of the capacity it pays for, but given Dropbox's customers at all levels probably don't use all their capacity it's not hard to see how Dropbox could get mighty close to a profit even if it pays AWS' published price, which we can't imagine it does.

So is Amazon making a profit on Cloud Drive's paid plans? AWS' genesis as Amazon's private cloud means it is sensible to assume Cloud Drive runs on S3 or something an awful lot like it, charged back between business units at low, low, mi casa es su casa prices. That could mean AWS can operate cloud storage space rather more cheaply than its advertised rates and almost certainly more cheaply than it charges even colossal customers like Dropbox.

Go read the whole piece.

Friday, March 29, 2013

Nature special issue on scientific publishing

Nature has a fascinating special issue on scientific publishing, which is well worth reading. In particular, the article Open Access: the true cost of science publishing by Richard Van Norden has a lot of valuable information. The sub-head is:

Cheap open-access journals raise questions about the value publishers add for their money

There is one major problem with this otherwise excellent article. Although it allows Tim Gowers to raise the question in the sub-head:

The key question is whether the extra effort adds useful value, says Timothy Gowers, a mathematician at the University of Cambridge, UK, who last year led a revolt against Elsevier

the article treats this as a he-said, she-said controversy. It fails to point to any of the massive accumulation of peer-reviewed research that answers this key question by showing that only the most selective journals add positive (if barely detectable) value. If the greater levels of misrepresentation and fraud in these journals are accounted for, their value-add is most likely negative.

Publishers, particularly for-profit publishers, are unable to acknowledge that their vaunted and expensive processes are not adding value, and are thus not worth paying for. This is demonstrated by the fact that Nature finds itself unable to cite the literature answering the question that it raises. Isn't part of Nature's value-add supposed to be making sure that relevant literature is cited?

Tuesday, March 26, 2013

Preserving personal data

4-slot Drobo

I've just started using the first product from the latest company of someone for whom I have great respect, serial entrepreneur Geoff Barrall. His previous company was Data Robotics, now Drobo after their product. I made a small investment in the company and have been using Drobos ever since the initial beta program. Geoff's team managed the all-too-rare feat in the industry of packaging up complex technology, in this case RAID, in a form that is both highly effective and very easy to use. Drobos are a wonderful way of protecting your data against disk failures - over the years the three original 4-slot Drobos in my home rack have handled disks filling up and failing with complete composure. They are now max-ed out with 2TB drives for a total of nearly 18TB of usable space; when this fills up I'll finally have to buy more units. Follow me below the fold for details on Geoff's new product.

Report on Digital Preservation and Cloud Services

One of the most valuable aspects of the Library of Congress' National Digital Stewardship Alliance (NDSA) is that it provides a forum for sharing expertise and experience among institutions trying to preserve the nation's digital heritage. The latest example is the Report on Digital Preservation and Cloud Services (PDF) written by Instrumental for, and published by, the Minnesota Historical Society. This is an excellent overview of the strategic and technical issues surrounding the potential use of a wide range of cloud services for preservation. I strongly recommend reading it.

Also, Rebecca Pool has a short piece on the same topic here, based in part on an interview she did with me some months ago.

Tuesday, March 12, 2013

Journals Considered Harmful

Via Yves Smith and mathbabe I found Deep Impact: Unintended consequences of journal rank by Björn Brembs and Marcus Munaf, which is a detailed analysis of the arguments I put forward at the Dagstuhl workshop on the Future of Research Communication and elsewhere.

The authors draw the following conclusions:

The current empirical literature on the effects of journal rank provides evidence supporting the following four conclusions: 1) Journal rank is a weak to moderate predictor of scientific impact; 2) Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability; 3) Journal rank is expensive, delays science and frustrates researchers; and, 4) Journal rank as established by [Impact Factor] violates even the most basic scientific standards, but predicts subjective judgments of journal quality.

Even if you disagree with their conclusions, their extensive bibliography is a valuable resource. Below the fold I discuss selected quotes from the paper.

Re-thinking Memento Aggregation

A bit more than two years ago in my second post about Memento I described some issues with its concept of Aggregators. These are the search-engine like services that guide browsers to preserved content. As part of the work we are doing to enhance the LOCKSS daemon software under a grant from the Mellon Foundation we have implemented the basic Memento mechanisms, so I'm now having to face some of these issues.

I have come to believe that the problems with the Aggregator concept are more fundamental than I originally described, and require a significant re-think. Below the fold I set out my view of the problems, and an outline of my proposed solution.

Facebook's "Cold Storage"

Last week Facebook announced they are building a couple of "cold storage" data centers:

Facebook will move older pictures and back-up photos to new-more energy efficient-data centers, called "cold storage" centers. ... the new "cold storage" centers-which are still in construction-will be five times more energy efficient and will allow users to access old images anytime without noticing any difference.

Facebook's problem is that they are ingesting 315M photos/day, or 7PB/month. Reducing the energy consumed by the backup copies and the older, less frequently accessed pictures is important. Although this is a work-in-progress and Facebook isn't talking about some details, it appears that among the techniques they are using are erasure coding, to operate with a lower replication factor, aggressively spinning down disks, using flash to hold indexes, and perhaps new, low-power drives such as these from Seagate, which claim 27% less power draw. They are working in the context of the Open Vault project, so this technology should eventually be available to others.

Thursday, February 21, 2013

Kai Li's FAST Keynote

Kai Li's keynote at the FAST 2013 conference was entitled Disruptive Innovation: Data Domain Experience. Data Domain was the pioneer of deduplication for backups. I was one of the people Sutter Hill asked to look at Data Domain when they were considering a B-round investment in 2003. I was very impressed, not just with their technology, but more with the way it was packaged as an appliance so that it was very easy to sell. The elevator pitch was "It is a box. You plug it into your network. Backups work better."

I loved Kai's talk. Not just because I had a small investment in the B round, so he made me money, but more because just about everything he said matched experiences I had at Sun or nVIDIA. Below the fold I discuss some of the details.

Thoughts from FAST 2013

I attended Usenix's 2013 FAST conference. I was so interested in Kai Li's keynote entitled Disruptive Innovation: Data Domain Experience that I'll devote a separate post to it. Below the fold are some other things that caught my attention. Thanks to Usenix's open access policy, you can follow the links and read the papers if I've piqued your interest.

Amazon's margins

I've been blogging a lot about the economics of cloud storage, and always using Amazon as the comparison. I've been stressing that the margins on their cloud storage business are extortionate. But Amazon is famous for running on very low margins. Below the fold I look at this conundrum.

Rothenberg still wrong

Last March Jeff Rothenberg gave a keynote entitled Digital Preservation in Perspective:How far have we come, and what's next? to the Future Perfect 2012 conference at the wonderful, must-visit Te Papa Tongarewa museum in Wellington, New Zealand. The video is here. The talk only recently came to my attention, for which I apologize.

I have long argued, for example in my 2009 CNI keynote, that while Jeff correctly diagnosed the problems of digital preservation in the pre-Web era, the transition to the Web that started in the mid-90s made those problems largely irrelevant. Jeff's presentation is frustrating, in that it shows how little his thinking has evolved to grapple with the most significant problems facing digital preservation today. Below the fold is my critique of Jeff's keynote.

DNA as a storage medium

I blogged last October about a paper from Harvard in Science describing using DNA as a digital storage medium. In a fascinating keynote at IDCC2013 Ewan Birney of EMBL discussed a paper in Nature with a much more comprehensive look at this technology. It has been getting a lot of press, much of it as usual somewhat misleading. Below the fold I delve into the details.

DAWN vs. Twitter

I blogged three weeks ago about the Library of Congress ingesting the Twitter feed, noting that the tweets were ending up on tape. It is over 130TB and growing 190GB/day. The Library is still trying to work out how to provide access to this collection; for example they cannot afford the infrastructure that would allow readers to perform keyword searches. This leaves the 400-odd researchers who have already expressed a need for access to the collection stymied. The British Library is also running into problems providing access to large collections, although not as large as Twitter. They are reduced to delivering 30TB NAS boxes to researchers, the same approach as Amazon and other services have taken to moving large amounts of data.

I mentioned this problem in passing in my earlier post, but I have come to understand that this observation has major implications for the future of digital preservation. Follow me below the fold as I discuss them.

Talk at IDCC2013

At IDCC2013 in Amsterdam I presented the paper Distributed Digital Preservation in the Cloud in which Daniel Vargas and I described an experiment in which we ran a LOCKSS box in Amazon's cloud. Or rather, I gave a talk that briefly motivated and summarized the paper and then focused on subsequent developments in cloud storage services, such as Glacier. Below the fold is an edited text of the talk with links to the resources. I believe that video of the talk (and, I hope, the interesting question-and-answer session that followed) will be made available eventually.

Podcast interview from Fall CNI 2012

Following on from my talk at the 2012 Fall CNI meeting on 11^th December Gerry Bayne interviewed me about the economics of using cloud services for preservation. The edited 12-minute MP3 has been posted on the Educause website. I think I did a pretty good job of explaining the fundamental business reasons why institutions are going to continue waste large amounts of money buying over-priced storage from the commercial cloud providers.

Tuesday, January 15, 2013

Moving vs. Copying

At the suggestion of my long-time friend Frankie, I've been reading Trillions, a book by Peter Lucas, Joe Ballay and Mickey McManus. They are principals of MAYA Design, a design firm that emerged from the Design and CS schools at Carnegie-Mellon in 1989. Among its founders was Jim Morris, who ran the Andrew Project at C-MU on which I worked from 1983-85. The ideas in the book draw not just from the Andrew Project's vision of a networked campus with a single, uniform file name-space, as partially implemented in the Andrew File System, but also from Mark Weiser's vision of ubiquitous computing at Xerox PARC. Mark's 1991 Scientific American article "The Computer of the 21^st Century" introduced the concept to the general public, and although the authors cite it, they seem strangely unaware of work going on at PARC and elsewhere for at least the last 6 years to implement the infrastructure that would make their ideas achievable. Follow me below the fold for the details.

How Much Of The Web Is Archived?

MIT's Technology Review has a nice article about Scott Ainsworth et al's important paper How Much Of The Web Is Archived? (readable summary here). The paper reports an important initial step in measuring the effectiveness of Web archiving, and Scott and his co-authors deserve much credit for it. Below the fold I summarize the paper and raise some caveats as to the interpretation of the results. Tip of the hat to the authors for comments on a draft of this post.

Go Library of Congress!

Carl Franzen at talkingpointsmemo.com pointed me to the report from the Library of Congress on the state of their ingest of the Twitter-stream. Congratulations to the team for two major achievements:

getting to the point where they have caught up with ingesting the past, even though some still remains to be processed into its final archival form,
and having an automated process in place capable of ingesting the current tweets in near-real-time.

The numbers are impressive:

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.

Notice the roughly 10-to-1 compression ratio. Each copy of the archive would be in the region of 1.3PB uncompressed. The average compressed tweet takes up about 130*10¹²/2*170*10⁹ = 380 bytes, so the metadata is far bigger than the 140 or less characters of the tweet itself. The library is ingesting about 0.5*10⁹ tweets/day at 380 bytes/tweet, or 190GB/day, or about 2.2Mb/s bandwidth (ignoring overhead). These numbers will grow as the flow of tweets increases. The data ends up on tape:

Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.

The scale and growth rate of this collection explain the difficulties the library has in satisfying the 400-odd requests they already have from scholars to access it for research purposes:

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.

This is a huge and important effort. Best wishes to the Library as they struggle with providing access and keeping up with the flow of tweets.

Thursday, December 12, 2013

Friday, December 6, 2013

Wednesday, December 4, 2013

Tuesday, November 26, 2013

Wednesday, November 20, 2013

Thursday, November 14, 2013

Tuesday, November 12, 2013

Wednesday, November 6, 2013

Tuesday, November 5, 2013

Wednesday, October 30, 2013

Wednesday, October 23, 2013

Tuesday, October 15, 2013

Monday, October 14, 2013

Tuesday, October 8, 2013

Friday, October 4, 2013

Tuesday, October 1, 2013

Tuesday, September 24, 2013

Monday, September 23, 2013

Tuesday, September 10, 2013

Friday, September 6, 2013

Thursday, September 5, 2013

Tuesday, September 3, 2013

Tuesday, August 27, 2013

Tuesday, August 20, 2013

Tuesday, August 13, 2013

Tuesday, July 30, 2013

Wednesday, July 24, 2013

Tuesday, July 16, 2013

Tuesday, July 9, 2013

Wednesday, July 3, 2013

Thursday, June 27, 2013

Tuesday, June 25, 2013

Friday, June 21, 2013

Tuesday, June 18, 2013

Saturday, June 15, 2013

Thursday, June 13, 2013

Sunday, May 26, 2013

Thursday, May 23, 2013

Thursday, May 16, 2013

Tuesday, May 14, 2013

Tuesday, May 7, 2013

Monday, April 29, 2013

Sunday, April 28, 2013

Saturday, April 27, 2013

Thursday, April 25, 2013

Tuesday, April 23, 2013

Wednesday, April 17, 2013

Monday, April 8, 2013

Thursday, April 4, 2013

Tuesday, April 2, 2013

Friday, March 29, 2013

Tuesday, March 26, 2013

Thursday, March 21, 2013

Tuesday, March 12, 2013

Tuesday, March 5, 2013

Tuesday, February 26, 2013

Thursday, February 21, 2013

Tuesday, February 19, 2013

Thursday, February 14, 2013

Tuesday, February 12, 2013

Thursday, January 31, 2013

Tuesday, January 29, 2013

Tuesday, January 22, 2013

Friday, January 18, 2013

Tuesday, January 15, 2013

Monday, January 7, 2013

Friday, January 4, 2013