- getting to the point where they have caught up with ingesting the past, even though some still remains to be processed into its final archival form,
- and having an automated process in place capable of ingesting the current tweets in near-real-time.
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.Notice the roughly 10-to-1 compression ratio. Each copy of the archive would be in the region of 1.3PB uncompressed. The average compressed tweet takes up about 130*1012/2*170*109 = 380 bytes, so the metadata is far bigger than the 140 or less characters of the tweet itself. The library is ingesting about 0.5*109 tweets/day at 380 bytes/tweet, or 190GB/day, or about 2.2Mb/s bandwidth (ignoring overhead). These numbers will grow as the flow of tweets increases. The data ends up on tape:
As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.
Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.The scale and growth rate of this collection explain the difficulties the library has in satisfying the 400-odd requests they already have from scholars to access it for research purposes:
The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.This is a huge and important effort. Best wishes to the Library as they struggle with providing access and keeping up with the flow of tweets.