DSHR's Blog

Tuesday, January 15, 2013

Moving vs. Copying

At the suggestion of my long-time friend Frankie, I've been reading Trillions, a book by Peter Lucas, Joe Ballay and Mickey McManus. They are principals of MAYA Design, a design firm that emerged from the Design and CS schools at Carnegie-Mellon in 1989. Among its founders was Jim Morris, who ran the Andrew Project at C-MU on which I worked from 1983-85. The ideas in the book draw not just from the Andrew Project's vision of a networked campus with a single, uniform file name-space, as partially implemented in the Andrew File System, but also from Mark Weiser's vision of ubiquitous computing at Xerox PARC. Mark's 1991 Scientific American article "The Computer of the 21^st Century" introduced the concept to the general public, and although the authors cite it, they seem strangely unaware of work going on at PARC and elsewhere for at least the last 6 years to implement the infrastructure that would make their ideas achievable. Follow me below the fold for the details.

How Much Of The Web Is Archived?

MIT's Technology Review has a nice article about Scott Ainsworth et al's important paper How Much Of The Web Is Archived? (readable summary here). The paper reports an important initial step in measuring the effectiveness of Web archiving, and Scott and his co-authors deserve much credit for it. Below the fold I summarize the paper and raise some caveats as to the interpretation of the results. Tip of the hat to the authors for comments on a draft of this post.

Go Library of Congress!

Carl Franzen at talkingpointsmemo.com pointed me to the report from the Library of Congress on the state of their ingest of the Twitter-stream. Congratulations to the team for two major achievements:

getting to the point where they have caught up with ingesting the past, even though some still remains to be processed into its final archival form,
and having an automated process in place capable of ingesting the current tweets in near-real-time.

The numbers are impressive:

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.

Notice the roughly 10-to-1 compression ratio. Each copy of the archive would be in the region of 1.3PB uncompressed. The average compressed tweet takes up about 130*10¹²/2*170*10⁹ = 380 bytes, so the metadata is far bigger than the 140 or less characters of the tweet itself. The library is ingesting about 0.5*10⁹ tweets/day at 380 bytes/tweet, or 190GB/day, or about 2.2Mb/s bandwidth (ignoring overhead). These numbers will grow as the flow of tweets increases. The data ends up on tape:

Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.

The scale and growth rate of this collection explain the difficulties the library has in satisfying the 400-odd requests they already have from scholars to access it for research purposes:

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.

This is a huge and important effort. Best wishes to the Library as they struggle with providing access and keeping up with the flow of tweets.

Tuesday, December 11, 2012

Talk at Fall 2012 CNI

I gave a talk at CNI's Fall 2012 Membership Meeting entitled The Truth Is Out There: Preservation and the Cloud. It was an updated and shortened version of a seminar I had given in mid-November at UC Berkeley's School of Information. Below the fold is an edited text with links to the resources.

Sharing makes Glacier economics even better

A more detailed analysis of the economics of Glacier sharing the same infrastructure as S3 than I posted here makes the picture look even better from Amazon's point of view. The point I missed is that the infrastructure is shared. Follow me below the fold for the details.

Nostalgia

Google has a nice post with a short video commemorating today's 50th birthday of the Ferranti Atlas, the UK's first "supercomputer". Although it wasn't the first computer I programmed, Cambridge University's Atlas 2 prototype, called Titan, was the machine I really learned programming on, starting in 1968. It was in production from 1966 to 1973 with a time-sharing operating system using Teletype KSR33 terminals, a device-independent file system and many other ground-breaking features. I got access to it late at night as an undergraduate member of the Archimedeans, the University Mathematical Society. I wrote programs in machine code (as I recall there was no mnemonic assembler, you had to remember the numeric op-codes), in Atlas Autocode, and in BCPL. Best of all, it was attached to a PDP-7 with a DEC 340 display, which a friend and I programmed to play games.

Thursday, December 6, 2012

Updating "More on Glacier Pricing"

In September I posted More on Glacier Pricing including a comparison with our baseline local storage model. Last week I posted Updating "Cloud vs. Local Storage Costs which among other things updated and corrected the baseline local storage model. Thus I needed to updated the comparison with Glacier too. Below the fold is this updated comparison, together with a back-of-the-envelope calculation to support the claim I've been making that although Glacier may look like tape, it could just be using S3's disk storage infrastructure.