Monday, September 23, 2013

Worth reading

Below the fold, quick comments on two good reads.

First, Resurrecting My Revolution: Using Social Link Neighborhood in Bringing Context to the Disappearing Web by Hany SalahaEldeen and Michael Nelson, which got noticed by Salon. It is in two parts. They first repeated their earlier study of the rates at which pages linked from social media vanished from the live Web, and appeared in archives. The new results support the earlier work, but reveal some instances in which resources that were "archived" in the earlier study no longer are. They hypothesize that this is because the earlier study treated search engine caches as "archives". I criticized this assumption in a blog post, and the new study doesn't include them.

The second part of their paper is novel. They observe the importance of social media, and Twitter in particular, to understanding current events. But the 140-character limit means that the Tweets themselves contain little of the information; they point to and comment on resources elsewhere on the Web. As their earlier works showed, even if the tweets are archived the resources they point to are steadily vanishing. So they ask whether it is possible to reconstruct a vanished resource well enough to understand it from its context in terms of tweets and other Web resources.

Their technique is complex; you need to read the paper for the details. But the results are fairly impressive. They constructed a set of live Web resources then applied their technique as if these resources had vanished. For 41% of the 731 "missing" resources they were able to find a replacement on the live Web that they measured as at least 70% similar to the "missing" resource.

Second, Trevor Pott at The Register has a good summary of the pros and cons of cloud storage. He is talking about cloud storage for general use, not specifically for long-term preservation. He points out the economies of scale:
Cloud providers leverage economies of scale. That doesn't just apply to getting cheap hardware or having a little bit of muscle when they sit down with Microsoft across that long negotiating table. It means things like testing, automation, quality assurance and the raw manpower that can be brought to bear.
He just doesn't ask where the benefits of the economies of scale end up (hint: not with the customer). On the other hand he is right on the money here:
The cloud is great for bursty workloads, temporary workloads, or dev environments. It's great for getting something working fast or for a fixed timeframe. Want to run for president? Well, you could invest in a sea can full of servers to do the work you need, or you can spin it all up on a public cloud for a couple of years and then tear it down after the election's over.
Pay as you go means no upfront investment. If your business succeeds then it will pay the bill at the end of the month. If it fails, it fails. You don't have to dig in your jeans to buy a great big pile of infrastructure just to play the game. All you need is the first month's "rent".
This just doesn't apply to digital preservation.

No comments: