Jill Lepore's New Yorker "Cobweb" article has focused attention on the importance of the Internet Archive, and the analogy with the Library of Alexandria. In particular on the risks implicit in the fact that both represent single points of failure because they are so much larger than any other collection.
Typically, Jason Scott was first to respond with a outline proposal to back up the Internet Archive, by greatly expanding the collaborative efforts of ArchiveTeam. I think Jason is trying to do something really important, and extremely difficult.
The Internet Archive's collection is currently around 15PB. It has doubled in size in about 30 months. Suppose it takes another 30 months to develop and deploy a solution at scale. We're talking crowd-sourcing a distributed backup of at least 30PB growing at least 3PB/year.
To get some idea of what this means, suppose we wanted to use Amazon's Glacier. This is, after all, exactly the kind of application Glacier is targeted at. As I predicted shortly after Glacier launched, Amazon has stuck with the 1c/GB/mo price. So in 2017 we'd be paying Amazon $3.6M a year just for the storage costs. Alternately, suppose we used Backblaze's Storage Pod 4.5 at their current price of about 5c/GB, for each copy we'd have paid $1.5M in hardware cost and be adding $150K worth per year. This ignores running costs and RAID overhead.
It will be very hard to crowd-source resources on this scale, which is why I say this is the opposite of Lots Of Copies Keep Stuff Safe. The system is going to be short of storage; the goal of a backup for the Internet Archive must be the maximum of reliability for the minimum of storage.
Nevertheless, I believe it would be well worth trying some version of his proposal and I'm happy to help any way I can. Below the fold, my comments on the design of such a system.
I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.
Tuesday, March 24, 2015
Tuesday, March 17, 2015
More Is Not Better
Hugh Pickens at /. points me to Attention decay in science, providing yet more evidence that the way the journal publishers have abdicated their role as gatekeepers is causing problems for science. The abstract claims:
The exponential growth in the number of scientific papers makes it increasingly difficult for researchers to keep track of all the publications relevant to their work. Consequently, the attention that can be devoted to individual papers, measured by their citation counts, is bound to decay rapidly. ... The decay is ... becoming faster over the years, signaling that nowadays papers are forgotten more quickly. However, when time is counted in terms of the number of published papers, the rate of decay of citations is fairly independent of the period considered. This indicates that the attention of scholars depends on the number of published items, and not on real time.Below the fold, some thoughts.
Friday, March 13, 2015
Journals Considered Even More Harmful
Two years ago I posted Journals Considered Harmful, based on this excellent paper which concluded that:
The current empirical literature on the effects of journal rank provides evidence supporting the following four conclusions: 1) Journal rank is a weak to moderate predictor of scientific impact; 2) Journal rank is a moderate to strong predictor of both intentional and unintentional scientific unreliability; 3) Journal rank is expensive, delays science and frustrates researchers; and, 4) Journal rank as established by [Impact Factor] violates even the most basic scientific standards, but predicts subjective judgments of journal quality.
Subsequent events justify skepticism about the net value journals add after allowing for the value they subtract. Now the redoubtable Eric Hellman points out another value-subtracted aspect of the journals with his important post entitled 16 of the top 20 Research Journals Let Ad Networks Spy on Their Readers. Go read it and be appalled.
Update: Eric very wisely writes:
Update: Eric very wisely writes:
I'm particularly concerned about the medical journals that participate in advertising networks. Imagine that someone is researching clinical trials for a deadly disease. A smart insurance company could target such users with ads that mark them for higher premiums. A pharmaceutical company could use advertising targeting researchers at competing companies to find clues about their research directions. Most journal users (and probably most journal publishers) don't realize how easily online ads can be used to gain intelligence as well as to sell products.I should have remembered that, less than 3 weeks ago, Brian Merchant at Motherboard posted Looking Up Symptoms Online? These Companies Are Tracking You, pointing out that health sites such as WebMD and, even less forgivably, the Centers for Disease Control, are rife with trackers selling their visitors' information to data brokers:
The CDC example is notable because it’s a government site, one we assume should be free of the profit motive, and entirely safe for use. “It’s basically negligence,”If you want to look up health information online, you need to use Tor.
Thursday, March 12, 2015
Google's near-line storage offering
Yesterday, Google announced the beta of their Nearline Storage offering. It has the same 1c/GB/mo pricing as Amazon's Glacier, but it has three significant differences:
I believe I know how Google has built their nearline technology, I wrote about it two years ago.
- It claims to have much lower latency, a few seconds instead of a few hours.
- It has the same (synchronous) API as Google's more expensive storage, where Glacier has a different (asynchronous) API than S3.
- Its pricing for getting data out lacks Glacier's 5% free tier, but otherwise is much simpler than Glacier's.
I believe I know how Google has built their nearline technology, I wrote about it two years ago.
Thursday, March 5, 2015
Archiving Storage Tiers
Tom Coughlin uses Hetzler's touch-rate metric to argue for tiered storage for archives in a two-part series. Although there's good stuff there, I have two problems with Tom's argument. Below the fold, I discuss them.
Tuesday, March 3, 2015
IDCC15
I wasn't able to attend IDCC2015 two weeks ago in London, but I've been catching up with the presentations on the Web. Below the fold, my thoughts on a few of them.