Software Heritage is an active project that has already assembled the largest existing collection of software source code. At the time of writing the Software Heritage Archive contains more than four billion unique source code files and one billion individual commits, gathered from more than 80 million publicly available source code repositories (including a full and up-to-date mirror of GitHub) and packages (including a full and up-to-date mirror of Debian). Three copies are currently maintained, including one on a public cloud.I have always believed, as I wrote in 2013:
As a graph, the Merkle DAG underpinning the archive consists of 10 billion nodes and 100 billion edges; in terms of resources, the compressed and fully de-duplicated archive requires some 200TB of storage space. These figures grow constantly, as the archive is kept up to date by periodically crawling major code hosting sites and software distributions, adding new software artifacts, but never removing anything. The contents of the archive can already be browsed online, or navigated via a REST API.
Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.I'm very disappointed that national libraries haven't accepted this argument, let alone the argument that preservation and access to their other digital collections largely depend on preserving and providing access to open source software. Since they have failed in this task, it is up to the Software Heritage Foundation to step into the breach.
You can find out more at their Web site, and support this important work by donating.