Tuesday, September 18, 2018

Vint Cerf on Traceability

Vint Cerf's Traceability addresses a significant problem:
how to preserve the freedom and openness of the Internet while protecting against the harmful behaviors that have emerged in this global medium. That this is a significant challenge cannot be overstated. The bad behaviors range from social network bullying and misinformation to email spam, distributed denial of service attacks, direct cyberattacks against infrastructure, malware propagation, identity theft, and a host of other ills
Cerf's proposed solution is:
differential traceability. The ability to trace bad actors to bring them to justice seems to me an important goal in a civilized society. The tension with privacy protection leads to the idea that only under appropriate conditions can privacy be violated. By way of example, consider license plates on cars. They are usually arbitrary identifiers and special authority is needed to match them with the car owners ... This is an example of differential traceability; the police department has the authority to demand ownership information from the Department of Motor Vehicles that issues the license plates. Ordinary citizens do not have this authority.
Below the fold I examine this proposal and one of the responses.

Thursday, September 13, 2018

Blockchain Solves Preservation!

We're in a period when blockchain or "Distributed Ledger Technology" is the Solution to Everything™, so it is inevitable that it will be proposed as the solution to the problems of digital preservation. John Collomosse et al's abstract for ARCHANGEL: Trusted Archives of Digital Public Documents states:
We present ARCHANGEL; a de-centralised platform for ensuring the long-term integrity of digital documents stored within public archives. Document integrity is fundamental to public trust in archives. Yet currently that trust is built upon institutional reputation --- trust at face value in a centralised authority, like a national government archive or University. ARCHANGEL proposes a shift to a technological underscoring of that trust, using distributed ledger technology (DLT) to cryptographically guarantee the provenance, immutability and so the integrity of archived documents. We describe the ARCHANGEL architecture, and report on a prototype of that architecture build over the Ethereum infrastructure. We report early evaluation and feedback of ARCHANGEL from stakeholders in the research data archives space.
This is a wonderful example of the way people blithely assume that the claimed properties of blockchain systems are actually delivered in the real world. Below the fold I ask whether Collomosse et al have applied appropriate skepticism to blockchain's claims, and whether they have considered the sustainability of their proposal.

Tuesday, September 11, 2018

What Does Data "Durability" Mean

In What Does 11 Nines of Durability Really Mean? David Friend writes:
No amount of nines can prevent data loss.

There is one very important and inconvenient truth about reliability: Two-thirds of all data loss has nothing to do with hardware failure.

The real culprits are a combination of human error, viruses, bugs in application software, and malicious employees or intruders. Almost everyone has accidentally erased or overwritten a file. Even if your cloud storage had one million nines of durability, it can’t protect you from human error.
Friend may be right that these are the top 5 causes of data loss, but over the timescale of preservation as opposed to storage they are far from the only ones. In Requirements for Digital Preservation Systems: A Bottom-Up Approach we listed 13 of them. Below the fold, some discussion of the meaning and usefulness of durability claims.

Tuesday, September 4, 2018

Chia Network

Back in March I wrote Proofs of Space, analyzing Bram Cohen's fascinating EE380 talk. I've now learned more about Chia Network, the company that is implementing a network using his methods. Below the fold I look into their prospects.

Thursday, August 30, 2018

What Does The Decentralized Web Need?

In, among others, It Isn't About The Technology, Decentralized Web Summit2018: Quick Takes and Special Report on Decentralizing the Internet I've been skeptical at considerable length about the prospect of a decentralized Web. I would really like the decentralized Web to succeed, so I admit I'm biased, just pessimistic.

I was asked to summarize what would be needed for success apart from working technology (which we pretty much have)? My answer was four things:
  • A sustainable business model
  • Anti-trust enforcement
  • The killer app
  • A way to remove content
Below the fold, I try to explain of each of them at more readable length.

Tuesday, August 28, 2018

Lending Emulations?

In my report Emulation and Virtualization as Preservation Strategies I discussed the legal issues around emulating obsolete software, the basis for the burgeoning retro-gaming industry. These issues have attracted attention recently, as Kyle Orland reports:
In the wake of Nintendo's recent lawsuits against other ROM distribution sites, major ROM repository EmuParadise has announced it will preemptively cease providing downloadable versions of copyrighted classic games.
Below the fold, some comments on this threat to our cultural history.

Friday, August 24, 2018

Triumph Of Greed Over Arithmetic

I discussed FileCoin's ICO in The Four Most Expensive Words in the English Language and worked out that:
Filecoin needs to generate $25.7M/yr over and above what it pays the providers. But it can't charge the customers more than S3, or $0.276/GB/yr. If it didn't pay the providers anything it would need to be storing over 93PB right away to generate a 10% return. That's a lot of storage to expect providers to donate to the system.
On my bike ride this morning I thought of another way of looking at FileCoin's optimistic economics.

FileCoin won't be able, as S3 does, to claim 11 nines of durability and triple redundancy across data centers. So the real competition is S3's Reduced Redundancy Storage, which currently costs $23K/PB/month. Assuming that Amazon continues its historic 15%/year Kryder rate, storing a Petabyte in RRS for a decade is $1.48M. So, if you believe cryptocurrency "prices", FileCoin's "investors" pre-paid $257M for data storage at some undefined time in the future. They could instead have, starting now, stored 174PB in S3's RRS for 10 years. So FileCoin needs to store at least 174PB for 10 years before breaking even.

It gets worse. S3 is by no means the low-cost provider in the storage market. If we assume that the competition is Backblaze's B2 service at $0.06/GB/yr and that their Kryder rate is zero, FileCoin would need to store 428PB for 10 years before breaking even. Nearly half an Exabyte for a decade!

Tuesday, August 21, 2018

Optical media durability

At last I started clearing out the garage laundry room cupboards, which is where amongst much other stuff the optical media backups I take every week have been accumulating for many years. They have been stored in a fairly warm shirt-sleeve environment with no special precautions. So to get some idea of the durability of writable optical media, I've been somewhat randomly pulling groups of backups out of the stacks and re-verifying the MD5 checksums, which were all verified immediately after writing.

TL;DR: Surprisingly, I'm getting good data from CD-Rs more than 14 years old, and from DVD-Rs nearly 12 years old. Your mileage may vary. Below the fold, my results.

Tuesday, August 14, 2018

The Internet of Torts

Rebecca Crootof at Balkinization has two interesting posts:
  • Introducing the Internet of Torts, in which she describes "how IoT devices empower companies at the expense of consumers and how extant law shields industry from liability."
  • Accountability for the Internet of Torts, in which she discusses "how new products liability law and fiduciary duties could be used to rectify this new power imbalance and ensure that IoT companies are held accountable for the harms they foreseeably cause.
Below the fold,some commentary on both.

Thursday, August 9, 2018

The Blockchain Trilemma

The blockchain trilemma
In The economics of blockchains Markus K Brunnermeier and Joseph Abadi (BA) write:
much of the innovation in blockchain technology has been aimed at wresting power from centralised authorities or monopolies. Unfortunately, the blockchain community’s utopian vision of a decentralised world is not without substantial costs. In recent research, we point out a ‘blockchain trilemma’ – it is impossible for any ledger to fully satisfy the three properties shown in Figure 1 simultaneously (Abadi and Brunnermeier 2018). In particular, decentralisation has three main costs: waste of resources, scalability problems, and network externality inefficiencies.
Below the fold, some commentary.

Tuesday, August 7, 2018

Decentralized Web Summit 2018: Quick Takes

Last week I attended the main two days of the 2018 Decentralized Web Summit put on by the Internet Archive at the San Francisco Mint. I had many good conversations with interesting people, but it didn't change the overall view I've written about in the past. There were a lot of parallel sessions, so I only got a partial view, and the acoustics of the Mint are TERRIBLE for someone my age, so I may have missed parts even of the sessions I was in. Below the fold, some initial reactions.

Thursday, August 2, 2018

Shitcoin And The Lightning Network

The Lightning Network is an overlay on the Bitcoin network, intended to remedy the fact that Bitcoin is unusable for actual transactions. Andreas Brekken, of shitcoin.com, tried installing, running and using a node. He describes his experience in four blog posts:
  1. Can I compile and run a node?
  2. We must first become the Lightning Network
  3. Paying for goods and services
  4. What happens when you close half of the Lightning Network?
Brekken's final TL;DR was “Operating the largest node on the Bitcoin Lightning Network has been educational, frustrating, fun, and at times terrifying. I look forward to trying it again once the technology matures.” Below the fold I look into some of the details.

Tuesday, July 31, 2018

Amazon's Margins Again

AMZN operating margins
I've been pointing out that economies of scale allow for the astonishing margins Amazon enjoys on S3, and the rest of AWS, for six years. Now, This is the Amazon everyone should have feared — and it has nothing to do with its retail business by Jason Del Rey and Rani Molla documents AWS' margins in this table.
Amazon’s $52.9 billion of revenue in the second quarter of the year came in a tad below what Wall Street analysts expected — and that doesn’t matter whatsoever.

That’s because the massive online retailer once again posted its largest quarterly profit in history — $2.5 billion for the quarter — on the back of two businesses that were afterthoughts just a few years ago: Amazon Web Services, its cloud computing unit, as well as its fast-growing advertising business.
Below the fold, I discuss one of the implications of these amazing margins.

Tuesday, July 17, 2018


One of the things that I, as an observer of the blockchain scene, find fascinating is how the various heists illuminate the deficiencies of actual, as opposed to the Platonic ideal, blockchain-based systems.

I've been writing for more than 4 years that, at scale, blockchains are DINO (Decentralized In Name Only) because irresistible economies of scale drive centralization. Now, a heist illuminates that, in practice, "smart contracts" such as those on the Ethereum blockchain (which is DINO) are also IINO (Immutable In Name Only). Follow me below the fold for the explanation.

Monday, July 9, 2018

School's out (meta)

Grandkids are sick, or didn't get into the camp their parents wanted, so blogging will be close to non-existent for a while. Sorry about that!

Tuesday, July 3, 2018

Special Report on Decentralizing the Internet (Updated)

The Economist's June 30th issue features a special report from Ludwig Siegele entitled How to fix what has gone wrong with the internet consisting of the following articles:
I really like the way The Economist occasionally allows its writers to address a topic at length. Siegele provides a good overview of what has gone wrong and the competing views of how to fix it. Below the fold, my overall critique, and commentary on some of the articles.

Monday, July 2, 2018

Josh Marshall on Facebook

Last September in Josh Marshall on Google, I wrote:
a quick note to direct you to Josh Marshall's must-read A Serf on Google's Farm. It is a deep dive into the details of the relationship between Talking Points Memo, a fairly successful independent news publisher, and Google. It is essential reading for anyone trying to understand the business of publishing on the Web.
Marshall wasn't happy with TPM's deep relationship with Google. In Has Web Advertising Jumped The Shark? I quoted him:
We could see this coming a few years ago. And we made a decisive and longterm push to restructure our business around subscriptions. So I'm confident we will be fine. But journalism is not fine right now. And journalism is only one industry the platform monopolies affect. Monopolies are bad for all the reasons people used to think they were bad. They raise costs. They stifle innovation. They lower wages. And they have perverse political effects too. Huge and entrenched concentrations of wealth create entrenched and dangerous locuses of political power.
Have things changed? Follow me below the fold.

Friday, June 29, 2018

Cryptocurrencies Have Limits

The Economic Limits Of Bitcoin And The Blockchain by Eric Budish is an important analysis of the economics of two kinds of "51% attack" on Bitcoin and other cryptocurrencies, such as those becoming endemic on Bitcoin Gold and other alt-coins:
  • A "double spend" attack, in which an attacker spends cryptocurrency to obtain goods, then makes the spend disappear in order to spend the cryptocurrency again.
  • A "sabotage" attack, in which short-sellers discredit the cryptocurrency to reduce its value.
Below the fold, some commentary on Budish's paper.

Thursday, June 28, 2018

Rate limits

Andrew Marantz writes in Reddit and the Struggle to Detoxify the Internet:
[On 2017's] April Fools’, instead of a parody announcement, Reddit unveiled a genuine social experiment. It was called r/Place, and it was a blank square, a thousand pixels by a thousand pixels. In the beginning, all million pixels were white. Once the experiment started, anyone could change a single pixel, anywhere on the grid, to one of sixteen colors. The only restriction was speed: the algorithm allowed each redditor to alter just one pixel every five minutes. “That way, no one person can take over—it’s too slow,” Josh Wardle, the Reddit product manager in charge of Place, explained. “In order to do anything at scale, they’re gonna have to coöperate."
The r/Place experiment successfully forced coöperation, for example with r/AmericanFlagInPlace drawing a Stars and Stripes, or r/BlackVoid trying to rub out everything:
Toward the end, the square was a dense, colorful tapestry, chaotic and strangely captivating. It was a collage of hundreds of incongruous images: logos of colleges, sports teams, bands, and video-game companies; a transcribed monologue from “Star Wars”; likenesses of He-Man, David Bowie, the “Mona Lisa,” and a former Prime Minister of Finland. In the final hours, shortly before the experiment ended and the image was frozen for posterity, BlackVoid launched a surprise attack on the American flag. A dark fissure tore at the bottom of the flag, then overtook the whole thing. For a few minutes, the center was engulfed in darkness. Then a broad coalition rallied to beat back the Void; the stars and stripes regained their form, and, in the end, the flag was still there.
What is important about the r/Place experiment? Follow me below the fold for an explanation.

Thursday, June 21, 2018

Software Heritage Archive Goes Live

June 7th was a big day for software preservation; it was the formal opening of Software Heritage's archive. Congratulations to Roberto di Cosmo and the team! There's a post on the Software Heritage blog with an overview:
Today, June 7th 2018, we are proud to be back at Unesco headquarters to unveil a major milestone in our roadmap: the grand opening of the doors of the Software Heritage archive to the public (the slides of the presentation are online). You can now look at what we archived, exploring the largest collection of software source code in the world: you can explore the archive right away, via your web browser. If you want to know more, an upcoming post will guide you through all the features that are provided and the internals backing them.
Morane Gruenpeter's Software Preservation: A Stepping Stone for Software Citation is an excellent explanation of the role that Software Heritage's archive plays in enabling researchers to cite software:
In recent years software has become a legitimate product of research gaining more attention from the scholarly ecosystem than ever before, and researchers feel increasingly the need to cite the software they use or produce. Unfortunately, there is no well established best practice for doing this, and in the citations one sees used quite often ephemeral URLs or other identifiers that offer little or no guarantee that the cited software can be found later on.

But for software to be findable, it must have been preserved in the first place: hence software preservation is actually a prerequisite of software citation.
The importance of preserving software, and in particular open source software, is something I've been writing about for nearly a decade. My initial post about the Software Heritage Foundation started:
Back in 2009 I wrote:
who is to say that the corpus of open source is a less important cultural and historical artifact than, say, romance novels.
Back in 2013 I wrote:
Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.
Please support this important work by donating to the Software Heritage Foundation.

Tuesday, June 19, 2018

The Four Most Expensive Words in the English Language

There are currently a number of attempts to deploy a cryptocurrency-based decentralized storage network, including MaidSafe, FileCoin, Sia and others. Distributed storage networks have a long history, and decentralized, peer-to-peer storage networks a somewhat shorter one. None have succeeded; Amazon's S3 and all other successful network storage systems are centralized.

Despite this history, initial coin offerings for these nascent systems have raised incredible amounts of "money", if you believe the heavily manipulated "markets". According to Sir John Templeton the four words are "this time is different". Below the fold I summarize the history, then ask what is different this time, and how expensive is it likely to be?

Tuesday, June 12, 2018

No-one could have predicted ...

... the threats posed by information technology to civil liberties. But my friend Robert G. Kennedy III came close. In April 1989 he wrote Technological Threats To Civil Liberties. From almost 30 years later it is an amazingly perceptive piece. Here are two samples to encourage you to read the whole thing:
An alarming synergy could occur when debit card data is accessed by connectionist machines (neural networks) for business applications. There are patterns to our behavior (economic and otherwise) of which we ourselves might be unaware; these can be extracted by neural nets without the need for formal rules, models, or a priori knowledge. A net is very, very good at pattern inference and recognition. ... One can see the potential for some truly subtle forms of embezzlement, irresistable invasive advertising keyed to surreptitiously acquired psychological profiles, or consumer fraud on a grand scale, among other things.
An executive I know has told me of an office surveillance/attendance system being installed at his company, along the same lines as home security systems. Commercial versions have been on the market for over a year. It uses interactive badges and scanners, sort of transponders-in-an-ID, to track the location, time, and identity of personnel in a building: sort of an electronic leash. (He confided that it is silly to treat employees as bar-coded merchandise; for my part, I was polite enough not to mention the phrase, "Big Brother".)
As you read, remember that it was written two-and-a-half years before the first US Web page went up (which was around 6th Dec. 1991).

Thursday, June 7, 2018

The Island of Misfit Toys

The Berkman Center's Johnathan Zittrain has a New York Times editorial entitled From Westworld to Best World for the Internet of Things starts:
Last month the F.B.I. issued an urgent warning: Everyone with home internet routers should reboot them to shed them of malware from “foreign cyberactors.”
Below the fold, some details and a critique of  Zittrain’s proposals for improving the IoT.

Tuesday, June 5, 2018

Cryptographers on Blockchains: Part 2 (updated)

Back in April I wrote Cryptographers on Blockchains; they weren't enthusiastic. It is time for some more of the same, so follow me below the fold.

Thursday, May 31, 2018

Recreational Bugs

At the San Diego Usenix in January 1989 I presented Visualizing X11 Clients, a paper written by David Lemke and myself. In email conversation about his Pie Menus: A 30 Year Retrospective, Don Hopkins unearthed the script for the talk I gave, which I posted to the "xpert@athena.mit.edu" mail list. To record the script for posterity, a slightly edited version is below the fold.

Don also unearthed A Window Manager for Bitmapped Displays and Unix, the paper James Gosling and I wrote describing the Andrew window manager for the Alvey Workshop at Cosener's House, Abingdon (29th April to 1st May 1985) (DOI). The entire workshop proceedings were subsequently published as Methodology of Window Management, and are online here. The Andrew window manager tiled the screen with windows because, as the quote at the head of the paper said:
You will get a better Gorilla effect if you use as big a piece of paper as possible. Kunihiko Kasahara, Creative Origami.
In retrospect, this wasn't a great idea.

Tuesday, May 29, 2018

Pie Menus

Don's NeWS Pie Menu
IIRC it is 1988, and James Gosling and I are in the Sun Microsystems booth at SIGGRAPH demo-ing the NeWS window system. Don Hopkins walks up with a tape cartridge in his hand and says "load this". Knowing Don, we do, and all of a sudden all the menus in the system are transformed from the conventional pull-right rectangles to circles divided into pie-slices. And Don, at that time the most caffeinated person I'd ever met, is blazing through the menus faster than we've ever seen before.

Why am I writing this thirty years later? Follow me below the fold.

Thursday, May 24, 2018

How Far Is Far Enough?

When collecting an individual web site for preservation by crawling it is necessary to decide where its edges are, which links encountered are "part of the site" and which are links off-site. The crawlers use "crawl rules" to make these decisions. A simple rule would say:
Collect all URLs starting https://www.nytimes.com/
NoScript on http://nytimes.com
If a complex "site" is to be properly preserved the rules need to be a lot more complex. The image shows the start of the list of DNS names from which the New York Times home page embeds resources. Preserving this single page, let alone the "whole site", would need resources from at least 17 DNS names. Rules are needed for each of these names. How are all these more complex rules generated? Follow me below the fold for the answer, and news of an encouraging recent development.

Tuesday, May 22, 2018

ASICs and Mining Centralization

Three and a half years ago, as part of my explanation of why peer-to-peer networks that were successful would become centralized, I wrote in Economies of Scale in Peer-to-Peer Networks:
When new, more efficient technology is introduced, thus reducing the cost per unit contribution to a P2P network, it does not become instantly available to all participants. As manufacturing ramps up, the limited supply preferentially goes to the manufacturers best customers, who would be the largest contributors to the P2P network. By the time supply has increased so that smaller contributors can enjoy the lower cost per unit contribution, the most valuable part of the technology's useful life is over.
I'm not a blockchain insider. But now in a blockbuster post a real insider, David Vorick, the lead developer of Sia, a blockchain based cloud storage platform, makes it clear that the effect I described has been dominating the Bitcoin and other blockchains for a long time, and that it has led to centralization in the market for mining hardware:
The biggest takeaway from all of this is that mining is for big players. The more money you spend, the more of an advantage you have, and there’s not an easy way to change that equation. At least with traditional Nakamoto style consensus, a large entity that produces and controls most of the hashrate seems to be more or less the outcome, and at the very best you get into a situation where there are 2 or 3 major players that are all on similar footing. But I don’t think at any point in the next few decades will we see a situation where many manufacturing companies are all producing relatively competitive miners. Manufacturing just inherently leads to centralization, and it happens across many different vectors.
Below the fold, the details.

Wednesday, May 16, 2018

Shorter talk at MSST2018

I was invited to give both a longer and a shorter talk at the 34th International Conference on Massive Storage Systems and Technology at Santa Clara University. Below the fold is the text with links to the sources of the shorter talk, which was updated from and entitled DNA's Niche in the Storage Market .

Longer talk at MSST2018

I was invited to give both a longer and a shorter talk at the 34th International Conference on Massive Storage Systems and Technology at Santa Clara University. Below the fold is the text with links to the sources of the longer talk, which was updated from and entitled The Medium-Term Prospects for Long-Term Storage Systems.

Monday, May 14, 2018

Blockchain for Peer Review

An initiative has started in the UK called Blockchain for Peer Review. It claims:
The project will develop a protocol where information about peer review activities (submitted by publishers) are stored on a blockchain. This will allow the review process to be independently validated, and data to be fed to relevant vehicles to ensure recognition and validation for reviewers.  By sharing peer review information, while adhering to laws on privacy, data protection and confidentiality, we will foster innovation and increase interoperability.
Everything about this makes sense and could be implemented with a database run by a trusted party, as for example CrossRef does for DOI resolution. Implementing it with a blockchain is effectively impossible. Follow me below the fold for the explanation.

Tuesday, May 8, 2018

Prof. James Morris: "One Last Lecture"

The most important opportunity in my career was when Prof. Bob Sproull, then at Xerox PARC, suggested that I should join the Andrew Project (paper) then just starting at Carnegie-Mellon and run by Prof. James (Jim) Morris. The two years I spent working with Jim and the incredibly talented team he assembled (James Gosling, Mahadev Satyanarayanan, Nathaniel Borenstein, ...) changed my life.

Jim's final lecture at CMU is full of his trademark insights and humor, covering the five mostly CMU computing pioneers who influenced his career. You should watch the whole hour-long video, but below the fold I have transcribed a few tastes:

Monday, May 7, 2018

Might Need Some Work

"I Agree" - Source
Cory Doctorow writes:
"I Agree" is Dima Yarovinsky's art installation for Visualizing Knowledge 2018, with printouts of the terms of service for common apps on scrolls of colored paper, creating a bar chart of the fine print that neither you, nor anyone else in the history of the world, has ever read.
Earlier, Doctorow explained that the GDPR requires that:
Under the new directive, every time a European's personal data is captured or shared, they have to give meaningful consent, after being informed about the purpose of the use with enough clarity that they can predict what will happen to it.

Wednesday, May 2, 2018

"Privacy Is No Longer A Social Norm"

It is widely believed that in 2010 Mark Zuckerberg said "Privacy is no longer a social norm" but apparently that wasn't exactly what he said. Below the fold, I take off from this and other misquotes to look at our home-town's major industry, surveillance. Facebook (now headquartered in Menlo Park) has been getting all the attention recently, but they probably know less about you than Palantir Technologies, still headquartered in Palo Alto.

Monday, April 30, 2018

Michael Nelson's Fifteen Minutes Of Fame

We interrupt our regularly scheduled blogging for this special announcement. Go read Michael Nelson's post Why we need multiple web archives: the case of blog.reidreport.com right now! Its a detailed account in several updates of the forensic analysis of Joy-Ann Reid claim that either her blog or the Internet Archive was hacked. Michael's work landed him a spot on CNN at 0930 April 29th. He did an excellent job of explanation. Half an hour later Reid walked back her claims.

Michael is right about the importance of multiple independent Web archives; once again the Lots Of Copies Keep Stuff Safe principle. But the economics of this multiplicity are problematic.

Thursday, April 26, 2018

Cryptographers On Blockchains

David Gerard's April 21st blog post is a real linkfest. Below the fold, commentary on four of the links.

Tuesday, April 24, 2018

All Your Tweets Are Belong To Kannada

Gerd Badur  CC BY-SA 3.0, Source
Sawood Alam and Plinio Vargas have a fascinating blog post documenting their investigation into why:
47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive.
Kannada is an Indian language spoken by only about 38 million people. Below the fold, some commentary.

Thursday, April 12, 2018

Your Tax Dollars At Work

When I was writing Pre-publication Peer Review Subtracts Value, Springer wanted to charge me $39.95 for access to Comparing Published Scientific Journal Articles to Their Pre-print Versions by Martin Klein et al. This despite the fact that the copyright notice said:
This is a U.S. government work and its text is not subject to copyright protection in the United States
Fortunately, you can now follow the link to the final version at arXiv.org. I'm not the only one annoyed by the publishers charging for access to papers not subject to copyright. Below the fold, some more on this scam.

Tuesday, April 10, 2018

Natural Redundancy

Most uncompressed files contain significant redundancy, which is why they can be made smaller by a compression algorithm; they work by reducing redundancy. The better the algorithm, the less redundancy left in the output. If the files are then stored for the long term, they need to be protected, for example by erasure coding, which adds some redundancy back. In Exploiting Source Redundancy to Improve the Rate of Polar Codes, Ying Wang, Krishna R. Narayanan and Anxiao (Andrew) Jiang of Texas A&M explore using the original redundancy to reduce the amount of protection redundancy needed for a given level of reliability. Below the fold, some commentary.

Monday, April 9, 2018

John Perry Barlow RIP

By Mohamed Nanabhay
from Qatar CC BY 2.0
Vicky Reich and I were both acquainted with John Perry Barlow in the 90s; we met at one of the parties he threw at the DNA Lounge. He was perhaps the most charismatic person I've ever encountered. So we were anxious to attend the symposium the EFF and the Internet Archive organized last Saturday to honor one aspect of his life, his writing and activism around civil liberties in cyberspace.

The Economist, The Guardian and the New York Times had good obituaries, but they mentioned only his Declaration of the Independence of Cyberspace among his writings. It was undoubtedly an important rallying-cry at the time, but it should not be allowed to overshadow his other cyberspace-related writings, thankfully collected by the EFF in the John Perry Barlow Library. Below the fold, the one I would have chosen.

Thursday, April 5, 2018

Emulating Stephen Hawking's Voice

Jason Fagone at the San Francisco Chronicle has a fascinating story of heroic, successful (and timely) emulation in The Silicon Valley quest to preserve Stephen Hawking’s voice. It's the story of a small team which started work in 2009 trying to replace Hawking's voice synthesizer with more modern technology. Below the fold, some details to get you to read the whole article

Tuesday, April 3, 2018

Falling Research Productivity

Are Ideas Getting Harder to Find? by Nicholas Bloom et al looks at the history of investment in R&D and its effect on the product across several industries. Their main example is Moore's Law, and they show that [page 19]:
research effort has risen by a factor of 18 since 1971. This increase occurs while the growth rate of chip density is more or less stable: the constant exponential growth implied by Moore’s Law has been achieved only by a massive increase in the amount of resources devoted to pushing the frontier forward.

Assuming a constant growth rate for Moore’s Law, the implication is that research productivity has fallen by this same factor of 18, an average rate of 6.8 percent per year.

If the null hypothesis of constant research productivity were correct, the growth rate underlying Moore’s Law should have increased by a factor of 18 as well. Instead, it was remarkably stable. Put differently, because of declining research productivity, it is around 18 times harder today to generate the exponential growth behind Moore’s Law than it was in 1971.
Below the fold, some commentary on this and other relevant research.

Thursday, March 29, 2018

Flash vs. Disk (Again)

Gartner's graph
Chris Mellor's NAND chips are going to stay too pricey for flash to slit disk's throat... is based on analysis from Gartner. It continues the theme that I've been stressing for quite some time; flash will not displace hard disk from the bulk storage layer of the hierarchy in the medium term. Follow me below the fold for some commentary, and more graphs.

Wednesday, March 28, 2018

Bitcoin: The Future World Currency?

I had a lot of fun applying arithmetic to DNA's prospects as a storage medium. Jamie Powell must have had just as much fun applying arithmetic to the prospect of Bitcoin becoming the world's currency in Sorry Jack, Bitcoin will not become the global currency., which is part of the FT Alphaville's excellent new Someone is wrong on the Internet series. Below the fold, some of the entertainment.

Tuesday, March 27, 2018

Bad Blockchain Content

A Quantitative Analysis of the Impact of Arbitrary Blockchain Content on Bitcoin by Roman Matzutt et al examines the stuff in the Bitcoin blockchain that isn't a monetary transaction. They:
provide the first systematic analysis of the benefits and threats of arbitrary blockchain content. Our analysis shows that certain content, e.g., illegal pornography, can render the mere possession of a blockchain illegal. Based on these insights, we conduct a thorough quantitative and qualitative analysis of unintended content on Bitcoin's blockchain. Although most data originates from benign extensions to Bitcoin's protocol, our analysis reveals more than 1600 files on the blockchain, over 99% of which are texts or images.
Below the fold, some details.

Thursday, March 22, 2018

Proofs of Space

Bram Cohen, the creator of BitTorrent, gave an EE380 talk entitled Stopping grinding attacks in proofs of space. Two aspects were really interesting:
  • A detailed critique of both the Proof of Work system used by most cryptocurrencies and blockchains, and schemes such as Proof of Stake that have been proposed to replace it.
  • An alternate scheme for securing blockchains based on combining Proof of Space with Verifiable Delay Functions.
But there was another aspect that concerned me. Follow me below the fold for details.

Tuesday, March 20, 2018

Pre-publication Peer Review Subtracts Value

Pre-publication peer review is intended to perform two functions; to prevent bad science being published (gatekeeping), and to improve the science that is published (enhancement). Over the years I've written quite often about how the system is no longer "fit for purpose". Its time for another episode draw attention to two not-so recent contributions:
Below the fold, the details.

Thursday, March 15, 2018

Ethics and Archiving the Web

I wanted to draw attention to what looks like a very interesting conference, Rhizome's National Forum on Ethics and Archiving the Web, March 22-24 at the New Museum in New York:
The dramatic rise in the public’s use of the web and social media to document events presents tremendous opportunities to transform the practice of social memory.

Web archives can serve as witness to crimes, corruption, and abuse; they are powerful advocacy tools; they support community memory around moments of political change, cultural expression, or tragedy. At the same time, they can cause harm and facilitate surveillance and oppression.

As new kinds of archives emerge, there is a pressing need for dialogue about the ethical risks and opportunities that they present to both those documenting and those documented. This conversation becomes particularly important as new tools, such as Rhizome’s Webrecorder software, are developed to meet the changing needs of the web archiving field.

Tuesday, March 13, 2018

The "Grand Challenges" of Curation and Preservation

I'm preparing for a meeting next week at the MIT Library on the "Grand Challenges" of digital curation and preservation. MIT, and in particular their library and press, have a commendable tradition of openness, so I've decided to post my input rather than submit it privately. My version of the challenges is below the fold.

Tuesday, March 6, 2018

Techno-hype part 2.5

Last November I wrote Techno-hype part 2 on cryptocurrencies and blockchains, reviewing David Gerard's excellent book Attack of the 50 Foot Blockchain: Bitcoin, Blockchain, Ethereum & Smart Contracts. A lot has happened since, so its time for an update. Below the fold, I look at three examples of how far these technologies are from being "ready for prime time":
  • The Lightning Network, which is supposed to allow Bitcoin to scale to billions of transactions.
  • IOTA, which is supposed to be a blockchain capable of supporting the Internet of Things.
  • Ethereum, which is supposed to be the infrastructure for "smart contracts".

Thursday, March 1, 2018

Archival Media: Not a Good Business

Thinking more about DNA's Niche in the Storage Market led me to focus on some problems with the market for archival media in general, not just DNA. The details are below the fold.

Tuesday, February 27, 2018

"Nobody cared about security"

There's a common meme that ascribes the parlous state of security on the Internet to the fact that in the ARPAnet days "nobody cared about security". It is true that in the early days of the ARPAnet security wasn't an important issue; everybody involved knew everybody else face-to-face. But it isn't true that the decisions taken in those early days hampered the deployment of security as the Internet took the shape we know today in the late 80s and early 90s. In fact the design decisions taken in the ARPAnet days made the deployment of security easier. The main reason for today's security nightmares is quite different.

I know because I was there, and to a small extent involved. Follow me below the fold for the explanation.

Thursday, February 22, 2018

Brief Talk at Video Game Preservation Workshop

I was asked to give a brief talk to the Video Game Preservation Workshop: Setting the Stage for Multi-Partner Projects at the Stanford Library, discussing the technical and legal aspects of cooperation on preserving software via emulation. Below the fold is an edited text of the talk with links to the sources.

Tuesday, February 20, 2018

Notes from FAST18

I attended the technical sessions of Usenix's File And Storage Technology conference this week. Below the fold, notes on the papers that caught my attention.

Thursday, February 15, 2018

Do You Need A Blockchain?

David Gerard's Do you need a Blockchain? Probably less than Wüst and Gervais think you do reviews an interesting paper, Do you need a Blockchain? by Karl Wüst and Arthur Gervais of ETH Zurich. Their abstract says:
In this article we critically analyze whether a blockchain is indeed the appropriate technical solution for a particular application scenario. We differentiate between permissionless (e.g., Bitcoin/Ethereum) and permissioned (e.g. Hyperledger/Corda) blockchains and contrast their properties to those of a centrally managed database.
Gerard is, for him, pretty enthusiastic about the paper:
This paper is worth your time. They explain the jargon at length, and discuss many commonly-advocated blockchain use cases — it’s a useful survey of the area — even as the authors are huge Bitcoin and blockchain advocates, and somewhat more optimistic for applying blockchains than is really warranted.
Below the fold, I look at both the paper and Gerard's review.

Wednesday, February 14, 2018

Tuesday, February 13, 2018

Correlated Cryptojacking

On February 11 at least 4,275 Web sites were found to have been simultaneously cryptojacked:
they include The City University of New York (cuny.edu), Uncle Sam's court information portal (uscourts.gov), Lund University (lu.se), the UK's Student Loans Company (slc.co.uk), privacy watchdog The Information Commissioner's Office (ico.org.uk) and the Financial Ombudsman Service (financial-ombudsman.org.uk), plus a shedload of other .gov.uk and .gov.au sites, UK NHS services, and other organizations across the globe.

Manchester.gov.uk, NHSinform.scot, agriculture.gov.ie, Croydon.gov.uk, ouh.nhs.uk, legislation.qld.gov.au, the list goes on.
They were all running Coinhive's Monero miner in visitors' browsers. How and why did this happen and what should these sites have been doing to prevent it? Follow me below the fold.

Monday, February 12, 2018

Lessons From Arquivo.pt

Daniel Gomes' video
I'd like to draw your attention to Daniel Gomes excellent video entitled Improving the robustness of the Arquivo.pt web archive.

Arquivo.pt is the Portuguese Web Archive. It got started in 2007, and in 2010 was an early archive to support full-text search. In 2013 it suffered a hardware malfunction that took the service down and lost 17% of its content. This led to a complete re-think of the system architecture, implementation, and operations. Daniel describes this process and the encouraging results in detail. It is well worth the 20 minutes to watch it.

Daniel divides the re-think into 5 major sections:
  1. Hardware and software architecture shifted to shared-nothing
  2. Reinforced replication policies
  3. Monitor the service
  4. Quality assurance for software development
  5. Document and test procedures
I'd agree with all these points. Many of the details correspond to things the LOCKSS Program focused on during preparation for the TRAC audit of the CLOCKSS Archive in 2014. This is especially the case for the last of Daniel's sections; the audit forced us to document our processes, which forced us to think about whether they were actually achieving their goals, which led to the discovery that in a number of cases they weren't.

Thursday, February 8, 2018

Meta: Blog Switched To HTTPS (Updated)

Because From July, Chrome will name and shame insecure HTTP websites I followed the instructions Hamad Ansari provides in Blogger Released Free SSL (HTTPS) For Custom Domains and enabled both "connections over HTTPS" and "HTTPS redirect", so that:
gets redirected to:
Everything I've tried so far works. Please comment on this post if you find things that don't work.

Update: Scott Helme points out that I'm just part of an encouraging trend. The graph shows the top million sites from Alexa in groups of 4,000. For each group, it shows the number of sites that are HTTPS (only, I believe). It shows that the pace of sites going HTTPS-only is increasing. The effect of Chrome's naming and shaming will presumably increase the rate of adoption further in July.

Tuesday, February 6, 2018

DNA's Niche in the Storage Market

I've been writing about storing data in DNA for the last five years, both enthusiastically about DNA's long-term prospects as a technology for storage, and pessimistically about its medium-term prospects. This time, I'd like to look at DNA storage systems as a product, and ask where their attributes might provide a fit in the storage marketplace.

As far as I know no-one has ever built a storage system using DNA as a medium, let alone sold one. Indeed, the only work I know on what such a system would actually look like is by the team from Microsoft Research and the University of Washington. Everything below the fold is somewhat informed speculation. If I've got something wrong, I hope the experts will correct me.

Thursday, January 25, 2018

Magical Thinking At The New York Times

Steven Johnson's Beyond The Bitcoin Bubble in the New York Times Magazine is a 9000-word explanation of how the blockchain can decentralize the Internet that appeared 5 days after my It Isn't About The Technology. Which is a good thing, because otherwise my post would have had to be much longer to address his tome. Follow me below the fold for the part I would have had to add to it.

Tuesday, January 23, 2018

Herbert Van de Sompel's Paul Evan Peters Award Lecture

In It Isn't About The Technology, I wrote about my friend Herbert Van de Sompel's richly-deserved Paul Evan Peters award lecture entitled Scholarly Communication: Deconstruct and Decentralize?, but only in the context of the push to "decentralize the Web". I believe Herbert's goal for this lecture was to spark discussion. In that spirit, below the fold, I have some questions about Herbert's vision of a future decentralized system for scholarly communications built on existing Web protocols. They aren't about the technology but about how it would actually operate.

Thursday, January 18, 2018

Tuesday, January 16, 2018

Not Really Decentralized After All

Here are two more examples of the phenomenon that I've been writing about ever since Economies of Scale in Peer-to-Peer Networks more than three years ago, centralized systems built on decentralized infrastructure in ways that nullify the advantages of decentralization:

Monday, January 15, 2018

The Internet Society Takes On Digital Preservation

Another worthwhile initiative comes from The Internet Society, through its New York chapter. They are starting an effort to draw attention to the issues around digital presentation. Shuli Hallack has an introductory blog post entitled Preserving Our Future, One Bit at a Time. They kicked off with a meeting at Google's DC office labeled as being about "The Policy Perspective". It was keynoted by Vint Cerf with respondents Kate Zwaard and Michelle Wu. I watched the livestream. Overall, I thought that the speakers did a good job despite wandering a long way from policies, mostly in response to audience questions.

Vint will also keynote the next event, at Google's NYC office February 5th, 2017, 5:30PM – 7:30PM. It is labeled as being about "Business Models and Financial Motives" and, if that's what it ends up being about it should be very interesting and potentially useful. I hope to catch the livestream.

Thursday, January 11, 2018

It Isn't About The Technology

A year and a half ago I attended Brewster Kahle's Decentralized Web Summit and wrote:
I am working on a post about my reactions to the first two days (I couldn't attend the third) but it requires a good deal of thought, so it'll take a while.
As I recall, I came away from the Summit frustrated. I posted the TL;DR version of the reason half a year ago in Why Is The Web "Centralized"? :
What is the centralization that decentralized Web advocates are reacting against? Clearly, it is the domination of the Web by the FANG (Facebook, Amazon, Netflix, Google) and a few other large companies such as the cable oligopoly.

These companies came to dominate the Web for economic not technological reasons.
Yet the decentralized Web advocates persist in believing that the answer is new technologies, which suffer from the same economic problems as the existing decentralized technologies underlying the "centralized" Web we have. A decentralized technology infrastructure is necessary for a decentralized Web but it isn't sufficient. Absent an understanding of how the rest of the solution is going to work, designing the infrastructure is an academic exercise.

It is finally time for the long-delayed long-form post. I should first reiterate that I'm greatly in favor of the idea of a decentralized Web based on decentralized storage. It would be a much better world if it happened. I'm happy to dream along with my friend Herbert Van de Sompel's richly-deserved Paul Evan Peters award lecture entitled Scholarly Communication: Deconstruct and Decentralize?. He describes a potential future decentralized system of scholarly communication built on existing Web protocols. But even he prefaces the dream with a caveat that the future he describes "will most likely never exist".

I agree with Herbert about the desirability of his vision, but I also agree that it is unlikely. Below the fold I summarize Herbert's vision, then go through a long explanation of why I think he's right about the low likelihood of its coming into existence.

Monday, January 8, 2018

The $2B Joke

Everything you need to know about cryptocurrency is in Timothy B. Lee's
Remember Dogecoin? The joke currency soared to $2 billion this weekend:
"Nobody was supposed to take Dogecoin seriously. Back in 2013, a couple of guys created a new cryptocurrency inspired by the "doge" meme, which features a Shiba Inu dog making excited but ungrammatical declarations. ... At the start of 2017, the value of all Dogecoins in circulation was around $20 million. ... Then on Saturday the value hit $2 billion. ... "It says a lot about the state of the cryptocurrency space in general that a currency with a dog on it which hasn't released a software update in over 2 years has a $1B+ market cap," [cofounder] Palmer told Coindesk last week.
So blockchain, such bubble. Up 100x in a year. Are you HODL-ing or getting your money out?

Digital Preservation Declaration of Shared Values

I'd like to draw your attention to the effort underway by a number of organizations active in digital preservation to agree on a Digital Preservation Declaration of Shared Values:
The digital preservation landscape is one of a multitude of choices that vary widely in terms of purpose, scale, cost, and complexity. Over the past year a group of collaborating organizations united in the commitment to digital preservation have come together to explore how we can better communicate with each other and assist members of the wider community as they negotiate this complicated landscape.

As an initial effort, the group drafted a Digital Preservation Declaration of Shared Values that is now being released for community comment. The document is available here: https://docs.google.com/document/d/1cL-g_X42J4p7d8H7O9YiuDD4-KCnRUllTC2s...

The comment period will be open until March 1st, 2018. In addition, we welcome suggestions from the community for next steps that would be beneficial as we work together.
The list of shared values (Collaboration, Affordability, Availability, Inclusiveness, Diversity, Portability/Interoperability, Transparency/information sharing, Accountability, Stewardship Continuity, Advocacy, Empowerment) includes several to which adherence in the past hasn't been great.

There are already good comments on the draft. Having more input, and input from a broader range of institutions, would help this potentially important initiative.

Friday, January 5, 2018

Meltdown & Spectre

This hasn't been a good few months for Intel. I wrote in November about the vulnerabilities in their Management Engine. Now they, and other CPU manufacturers are facing Meltdown and Spectre, three major vulnerabilities caused by side-effects of speculative execution. The release of these vulnerabilities was rushed and the initial reaction less than adequate.

The three vulnerabilties are very serious but mitigations are in place and appear to be less costly than reports focused on the worst-case would lead you to believe. Below the fold, I look at the reaction, explain what speculative execution means, and point to the best explanation I've found of where the vulnerabilities come from and what the mitigations do.

Tuesday, January 2, 2018

The Box Conspiracy

Growing up in London left me with a life-long interest in the theatre (note the spelling).  Although I greatly appreciate polished productions of classics, such as the Royal National Theatre's 2014 King Lear, my particular interests are:
I've been writing recently about Web advertising, reading Tim Wu's book The Attention Merchants: The Epic Scramble to Get Inside Our Heads, and especially watching Dude, You Broke The Future, Charlie Stross' keynote for the 34th Chaos Communications Congress. As I do so, I can't help remembering a show I saw nearly a quarter of a century ago that fit the last of those categories. Below the fold I pay tribute to the prophetic vision of an under-appreciated show and its author.