Friday, January 4, 2013

Go Library of Congress!

Carl Franzen at talkingpointsmemo.com pointed me to the report from the Library of Congress on the state of their ingest of the Twitter-stream. Congratulations to the team for two major achievements:
  • getting to the point where they have caught up with ingesting the past, even though some still remains to be processed into its final archival form,
  • and having an automated process in place capable of ingesting the current tweets in near-real-time.
The numbers are impressive:
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.
Notice the roughly 10-to-1 compression ratio. Each copy of the archive would be in the region of 1.3PB uncompressed. The average compressed tweet takes up about 130*1012/2*170*109 = 380 bytes, so the metadata is far bigger than the 140 or less characters of the tweet itself. The library is ingesting about 0.5*109 tweets/day at 380 bytes/tweet, or 190GB/day, or about 2.2Mb/s bandwidth (ignoring overhead). These numbers will grow as the flow of tweets increases. The data ends up on tape:
Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.
The scale and growth rate of this collection explain the difficulties the library has in satisfying the 400-odd requests they already have from scholars to access it for research purposes:
The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.
This is a huge and important effort. Best wishes to the Library as they struggle with providing access and keeping up with the flow of tweets.

1 comment:

David. said...

The Library of Congress' bulk collection of Twitter will cease December 31:

"After this time, the Library will continue to acquire tweets but will do so on a very selective basis under the overall guidance provided in the Library’s Collections Policy Statements and associated documents (loc.gov/acq/devpol/). Generally, the tweets collected and archived will be thematic and event-based, including events such as elections, or themes of ongoing national interest, e.g. public policy.

The Library will also engage with Twitter to resolve issues associated with managing transactions that generate deletions of tweets, and user access issues. The Twitter collection will remain embargoed until access issues can be resolved in a cost-effective and sustainable
manner."

Hat tip to Matt Novak at Gizmodo.