Tuesday, July 7, 2015

IIPC Preservation Working Group

The Internet Archive has by far the largest archive of Web content but its preservation leaves much to be desired. The collection is mirrored between San Francisco and Richmond in the Bay Area, both uncomfortably close to the same major fault systems. There are partial copies in the Netherlands and Egypt, but they are not synchronized with the primary systems.

Now, Andrea Goethals and her co-authors from the IIPC Preservation Working Group have a paper entitled Facing the Challenge of Web Archives Preservation Collaboratively that reports on a survey of Web archives' preservation activities in the following areas; Policy, Access, Preservation Strategy, Ingest, File Formats and Integrity. They conclude:
This survey also shows that long term preservation planning and strategies are still lacking to ensure the long term preservation of web archives. Several reasons may explain this situation: on one hand, web archiving is a relatively recent field for libraries and other heritage institutions, compared for example with digitization; on the other hand, web archives preservation presents specific challenges that are hard to meet.
I discussed the problem of creating and maintaining a remote backup of the Internet Archive's collection in The Opposite of LOCKSS. The Internet Archive isn't alone in having less than ideal preservation of its collection. It's clear the major challenges are the storage and bandwidth requirements for Web archiving, and their rapid growth. Given the limited resources available, and the inadequate reliability of current storage technology, prioritizing collecting more content over preserving the content already collected is appropriate.

Tuesday, June 30, 2015

Blaming the Victim

The Washington Post is running a series called Net of Insecurity. So far it includes:
  • A Flaw In The Design, discussing the early history of the Internet and how the difficulty of getting it to work at all and the lack of perceived threats meant inadequate security.
  • The Long Life Of A Quick 'Fix', discussing the history of BGP and the consistent failure of attempts to make it less insecure, because those who would need to take action have no incentive to do so.
  • A Disaster Foretold - And Ignored,  discussing L0pht and how they warned a Senate panel 17 years ago of the dangers of Internet connectivity but were ignored.
Perhaps a future article in the series will describe how successive US administrations consistently strove to ensure that encryption wasn't used to make systems less insecure and, the encryption that was used was as weak as possible. They prioritized their (and their opponents) ability to spy over mitigating the risks that Internet users faced, and they got what they wanted. As we see with the compromise of the Office of Personnel Management and the possibly related compromise of health insurers including Anthem. These breaches revealed the kind of information that renders everyone with a security clearance vulnerable to phishing and blackmail. Be careful what you wish for!

More below the fold.

Tuesday, June 23, 2015

Future of Research Libraries

Bryan Alexander reports on a talk by Xiaolin Zhang, the head of the National Science Library at the Chinese Academy of Sciences (CAS), on the future of research libraries.
Director Zhang began by surveying the digital landscape, emphasizing the ride of ebooks, digital journals, and machine reading. The CAS decided to embrace the digital-first approach, and canceled all print subscriptions for Chinese-language journals. Anything they don’t own they obtain through consortial relationships ...

This approach works well for a growing proportion of the CAS constituency, which Xiaolin referred to as “Generation Open” or “Generation Digital”. This group benefits from – indeed, expects – a transition from print to open access. For them, and for our presenter, “only ejournals are real journals. Only smartbooks are real books… Print-based communication is a mistake, based on historical practicality.” It’s not just consumers, but also funders who prefer open access.
Below the fold, some thoughts on Director Zhang's vision.

Friday, June 19, 2015

EE380 talk on eBay storage

Russ McElroy & Farid Yavari gave a talk to Stanford's EE380 course describing how eBay's approach to storage (YouTube) is driven by their Total Cost of Ownership (TCO) model. As shown in this screengrab, by taking into account all the cost elements, they can justify the higher capital cost of flash media in much the way, but with much more realistic data and across a broader span of applications, that Ian Adams, Ethan Miller and I did in our 2011 paper Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes.

We were inspired by a 2009 paper FAWN A Fast Array of Wimpy Nodes in which David Andersen and his co-authors from C-MU showed that a network of large numbers of small CPUs coupled with modest amounts of flash memory could process key-value queries at the same speed as the networks of beefy servers used by, for example, Google, but using 2 orders of magnitude less power.

As this McElroy slide shows, power cost is important and it varies over a 3x range (a problem for Kaminska's thesis about the importance of 21 Inc's bitcoin mining hardware). He specifically mentions the need to get the computation close to the data, with ARM processors in the storage fabric. In this way the amount of data to be moved can be significantly reduced, and thus the capital cost, since as he reports the cost of the network hardware is 25% of the cost of the rack, and it burns a lot of power.

At present, eBay relies on tiering, moving data to less expensive storage such as consumer hard drives when it hasn't been accessed in some time. As I wrote last year:
Fundamentally, tiering like most storage architectures suffers from the idea that in order to do anything with data you need to move it from the storage medium to some compute engine. Thus an obsession with I/O bandwidth rather than what the application really wants, which is query processing rate. By moving computation to the data on the storage medium, rather than moving data to the computation, architectures like DAWN and Seagate's and WD's Ethernet-connected hard disks show how to avoid the need to tier and thus the need to be right in your predictions about how users will access the data.
That post was in part about Facebook's use of tiering, which works well because Facebook has highly predictable data access patterns. McElroy's talk suggests that eBay's data accesses are somewhat predictable, but much less so than Facebook's. This makes his implication that tiering isn't a good long-term approach plausible.

Tuesday, June 16, 2015

Alphaville on Bitcoin

I'm not a regular reader of the Financial Times, so I really regret I hadn't noticed that Izabella Kaminska and others at the FT's Alphaville blog have been posting excellent work in their BitcoinMania series. For a taste, see Bitcoin's lien problem, in which Kaminska discusses the problems caused by the fact that the blockchain records the transfer of assets but not the conditions attached to the transfer:
For example, let's hypothesise that Tony Soprano was to start a bitcoin loan-sharking operation. The bitcoin network would have no way of differentiating bitcoins being transferred from his account with conditions attached - such as repayment in x amount of days, with x amount of points of interest or else you and your family get yourself some concrete boots รข€” and those being transferred as legitimate and final settlement for the procurement of baked cannoli goods.

Now say you've lost all the bitcoin you owe to Tony Soprano on the gambling website Satoshi Dice. What are the chances that Tony forgets all about it and offers you a clean slate? Not high. Tony, in all likelihood, will pursue his claim with you.
She reports work by George K. Fogg at Perkins Coie on the legal status of Tony's claim:
Indeed, given the high volume of fraud and default in the bitcoin network, chances are most bitcoins have competing claims over them by now. Put another way, there are probably more people with legitimate claims over bitcoins than there are bitcoins. And if they can prove the trail, they can make a legal case for reclamation.

This contrasts considerably with government cash. In the eyes of the UCC code, cash doesn't take its claim history with it upon transfer. To the contrary, anyone who acquires cash starts off with a clean slate as far as previous claims are concerned. ... According to Fogg there is currently only one way to mitigate this sort of outstanding bitcoin claim risk in the eyes of US law. ... investors could transform bitcoins into financial assets in line with Article 8 of the UCC. By doing this bitcoins would be absolved from their cumbersome claim history.

The catch: the only way to do that is to deposit the bitcoin in a formal (a.k.a licensed) custodial or broker-dealer agent account.
In other words, to avoid the lien problem you have to submit to government regulation, which is what Bitcoin was supposed to escape from. Government-regulated money comes with a government-regulated dispute resolution system. Bitcoin's lack of a dispute resolution system is seen in the problems Ross Ulbricht ran in to.

Below the fold, I start from some of Kaminska's more recent work and look at another attempt to use the blockchain as a Solution to Everything.

Tuesday, June 9, 2015

Preserving the Ads?

Quinn Norton writes in The Hypocrisy of the Internet Journalist:
It’s been hard to make a living as a journalist in the 21st century, but it’s gotten easier over the last few years, as we’ve settled on the world’s newest and most lucrative business model: invasive surveillance. News site webpages track you on behalf of dozens of companies: ad firms, social media services, data resellers, analytics firms — we use, and are used by, them all.
...
I did not do this. Instead, over the years, I only enabled others to do it, as some small salve to my conscience. In fact, I made a career out of explaining surveillance and security, what the net was doing and how, but on platforms that were violating my readers as far as technically possible.
...
We can become wizards in our own right, a world of wizards, not subject to the old powers that control us now. But it’s going to take a lot of work. We’re all going to have to learn a lot — the journalists, the readers, the next generation. Then we’re going to have to push back on the people who watch us and try to control who we are.
Georgis Kontaxis and Monica Chew won "Best Paper" at the recent Web 2.0 Security and Privacy workshop for Tracking Protection in Firefox for Privacy and Performance (PDF). They demonstrated that Tracking Protection provided:
a 67.5% reduction in the number of HTTP cookies set during a crawl of the Alexa top 200 news sites. [and] a 44% median reduction in page load time and 39% reduction in data usage in the Alexa top 200 news site.
Below the fold, some details and implications for preservation:

Sunday, June 7, 2015

Brief talk at Columbia

I gave a brief talk during the meeting at Columbia on Web Archiving Collaboration: New Tools and Models to introduce the session on Tools/APIS: integration into systems and standardization. The title was "Web Archiving APIS: Why and Which?" An edited text is below the fold