Tuesday, November 19, 2019

Seeds Or Code?

I'd like to congratulate Microsoft on a truly excellent PR stunt, drawing attention to two important topics about which I've been writing for a long time, the cultural significance of open source software, and the need for digital preservation. Ashlee Vance provides the channel to publicize the stunt in Open Source Code Will Survive the Apocalypse in an Arctic Cave. In summary, near Longyearbyen on Spitzbergen is:
the Svalbard Global Seed Vault, where seeds for a wide range of plants, including the crops most valuable to humans, are preserved in case of some famine-inducing pandemic or nuclear apocalypse.
Nearby, in a different worked-out coal mine, is the Arctic World Archive:
The AWA is a joint initiative between Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and very-long-term digital preservation provider Piql AS. AWA is devoted to archival storage in perpetuity. The film reels will be stored in a steel-walled container inside a sealed chamber within a decommissioned coal mine on the remote archipelago of Svalbard. The AWA already preserves historical and cultural data from Italy, Brazil, Norway, the Vatican, and many others.
Github, the newly-acquired Microsoft subsidiary, will deposit there:
The 02/02/2020 snapshot archived in the GitHub Arctic Code Vault will sweep up every active public GitHub repository, in addition to significant dormant repos as determined by stars, dependencies, and an advisory panel. The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.
Follow me below the fold for an explanation of why I call this admirable effort a PR stunt, albeit a well-justified one.

Github's CEO, Nat Friedman, explains why they're doing this:
Open source software, in his view, is one of the great achievements of our species, up there with the masterpieces of literature and fine art. It has become the foundation of the modern world—not just the internet and smartphones, but satellites, medical devices, scientific tools, robots.
I have always believed this, as I wrote in 2013:
Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.
Vance provides just one clue as to why it is a stunt:
If the world is ravaged to the point where Svalbard is the last repository of usable wheat and corn seeds, the source code for YouTube will probably rank pretty low on humankind’s hierarchy of needs.
Lets try to construct a plausible scenario in which the films in the cave in Svalbard would be useful in the foreseeable future.

Our story starts in a Mad Max world. A huge Coronal Mass Ejection (CME) has destroyed the world's electrical grid and everything it powers. Microsoft's servers and backup tapes are history. So are the Software Heritage collection and all the contributors' copies of their work. The Internet is down. The zombie apocalypse, climate change, and a genetically engineered bio-weapon have wiped out most humans, and society has collapsed. But a plucky group of survivors has a solar powered Raspberry Pi, a Pi camera, a micro-SD card with Raspbian, and an 24 TB USB flash drive. They had all been proactively sealed in a coffee can, so they survived the CME. They know of films in a mine in Svalbard with the contents of Github. They set out to find the films and thus save the world.

Svalbard by Oona Räisänen
First, they have to get to Svalbard and then get back. These days it is much easier than it was in WW II, when my father served on Arctic Convoy escorts, or in the summer of 1969 when it took an icebreaker a few days sailing to get a dozen Cambridge University geologists and geology students (me included) there at the start of the summer. And a few days at the end of summer to get us back across the Barents Sea through a storm severe enough to sink a similar ship and sweep one of our helicopters off the flight deck. The hull of an icebreaker is semi-circular in cross-section so they ride up above ice encroaching from the sides. So the rolling of an icebreaker in rough seas has to be experienced to be believed.

Nowadays there are scheduled flights to LYR. Of course in any scenario where the films would be needed you couldn't fly there, and you probably couldn't fuel a ship sturdy enough to survive the much worse weather after a few decades of global warming (our icebreaker had to be accompanied by a tanker ship to last the season). You can't walk there. Even today you'd be very lucky to survive the weather during a round-trip from Tromsø in a sailing yacht without access to forecasts or GPS. And you'd need to get to Tromsø, which is a long walk from anywhere our survivors would be likely to start from on the Eurasian land mass.

The voyage
Lets assume that our heroes make it to Tromsø and find a nuclear submarine which just happened to be moored to the wharf with its hatches closed and antennae retracted when the CME hit, so it survived unscathed. The crew, however, after mooring it and closing the hatches, went to sample the nightlife of Tromsø, so they succumbed to the bio-weapon or the zombies. Once they get the hatches open, our heroes find a printed copy of Getting Started With Your New Submarine on the control desk. And fortunately the crew hadn't bothered to change the sub's default password (admin/admin).

Being very lucky with the weather our heroes manage to navigate by the stars and dead reckoning to Longyearbyen. Global warming means they don't have to worry, as we did, about polar bears and there will be plenty of dead seals whose blubber can light their way into the cave.

By the flickering light of seal-blubber torches our heroes creep cautiously into the cave and find the repository where, decades ago:
Friedman comes to what looks like a metal tool shed. ... Friedman unlocks the container door with a simple door key and, inside, deposits much of the world’s open source software code.
But where is the key? Fortunately, one of our heroes is an expert lock-picker, so it is mere moments before they are face-to-face with:
200 platters, each carrying 120 gigabytes of open source software code,
Along with:
Vatican archives, Brazilian land registry records, loads of Italian movies, and the recipe for a certain burger chain’s special sauce
They drag the 200 platters out into the 24hr sunshine, plug the solar panel into the Raspberry Pi, point its camera through a magnifying glass at the first frame, and let the QR app they happen to have on the Pi's micro--SD card do its thing. A couple of seconds later they have the first 2,900 bytes on the USB drive. It takes another couple of seconds to move to the next frame by hand. So they sit there for 383 days scanning a frame every 4 seconds to decode the entire archive. Except there's only sunshine enough for the Pi half the year, so it takes rather more than two years.

Then they need to start the Pi building all that code ...

Of course, this is ridiculous. No-one will decode this archive in the foreseeable future. It is a PR stunt, or perhaps more accurately a koan, like the golden records of Voyager 1 and 2, or the Long Now's clock, to get people to think about the importance of the long term. Such koans have considerable value, as I wrote recently:
IIRC it was at the 1996 Hacker's conference where Hillis and Brand talked about the idea of the clock. The presentation set me thinking about the long-term future of digital information, and about how systems to provide it needed to be ductile rather than, like Byzantine Fault Tolerance, brittle. The LOCKSS Program was the result a couple of years later.
Vance writes:
In the range of possible futures in which humanity has working modern computers, but no software to run on them, the archive and its Tech Tree could be extremely valuable. However, the value is more likely to be historical, perhaps ensuring that today’s technology is not lost by a tomorrow that carelessly considers it irrelevant—until an unexpected use for our software is discovered.
Actually, that range of futures would face a nearly insuperable bootstrap problem, so the range would effectively be zero. That's why our heroes' Raspberry Pi in a coffee can needed a micro-SD card with Raspbian, to avoid the need to bootstrap the software stack. How you would actually bootstrap the software stack if you needed to is left as an exercise to the reader. Some clues are here:
One sub-project is Stage0:
Stage0 starts with just a 280byte Hex monitor and builds up the infrastructure required to start some serious software development. With zero external dependencies, with the most painful work already done and real languages such as assembly, forth and garbage collected lisp already implemented.
The current 0.2.0 release of Stage0:
marks the first C compiler hand written in Assembly with structs, unions, inline assembly and the ability to self-host it's C version, which is also self-hosting
There is clearly a long way still to go to a bootstrapped full toolchain.
PDP-1 Console
By fjarlq/Matt CC-BY 2.0
But think about the lack of switches like the PDP-1s.

Vance is right that it might be a useful historical artifact. But that isn't really a good justification for the effort compared to its effect in pushing back on today's short-termism.

If it were some other organization I would be pointing out that this is an extraordinarily expensive way to preserve 24TB of data, but Microsoft can afford it. It is, however, important to observe that the films are only part of this project. Microsoft understands that:
Archiving software across multiple organizations and forms of storage will help ensure its long-term preservation: online archivists call this "LOCKSS," for Lots Of Copies Keeps Stuff Safe.
Even in the near future, storing data with multiple partners provides options to people whose access might otherwise be restricted. If GitHub were to become unavailable in any location, for example due to an internet routing issue, those affected could access public code for their projects using the Internet Archive and Software Heritage Foundation.
They are using a range of technologies, making feeds available over the Internet, and partnering with the Internet Archive, the Software Heritage Foundation and the Bodleian Library. These are mostly things which will get used in the foreseeable future, and should be applauded for that reason.

PS: The correct English is LOCKSS standing for Lots Of Copies Keep Stuff Safe. Not "Keeps": I should know. And it is a trademark of Stanford University.


Dragan Espenschied said...

If I had one Euro for each digital preservation project that shoots data into oblivion—to the moon, into abandoned mine shafts, under the eternal ice—I could already buy a hard drive.

David. said...

Ooops - I forgot to provide our plucky band of survivors with some way to talk to the Raspberry Pi, so they would be stuck after they dragged the films out into the sunlight. Somehow a display, a keyboard and some means to power them must have survived the CME. Or maybe it was a very large coffee can with a Raspberry Pi laptop inside.

Jon Evans said...

Archive Program director here - it's really not a PR stunt, we genuinely believe it will be of significant historical value and quite a good chance it will be of practical value.

Much of that is "if we forget technology which we realize somewhere down the road we actually might want to use again." History provides plenty of examples of this, and it's particularly important with a technology which mostly lives on ephemeral media that only lasts a few decades.

Even if you do expand your speculation to post-disaster scenarios, though, while it's true the archive wouldn't be an instant reset button, it could help greatly accelerate the recovery of software technology. It's worth noting that it will come with a slew of (human-readable, not encoded) technical works regarding subjects ranging from modern software engineering to microprocessor design to photolithography to power systems, which we call the Tech Tree, along with a guide and index to all the stored repos. Wherever its inheritors / discoverers may be in terms of technological advancement, and especially if they have modern-ish hardware (which can last much, much longer than most storage media), recovering the archive's contents will be a lot faster than rediscovering them from scratch.

(Also worth noting we'll be storing "greatest hits" copies of the ~15,000 most-starred / most-relied-on repos, along with a sampling of several thousand repos with few/no stars, in a selection of places like Oxford's Bodleian Library; our hypothetical future tech seekers won't have to go all the way to Svalbard for those.)

I don't want to stress the doomsday scenarios too much, though. I think the most likely outcome by far is that progress will continue, the archive may be useful to recover a couple of otherwise forgotten technologies that suddenly become important / interesting, and it will ultimately be chiefly of interest to historians. The historical value is a key reason why it casts such a broad net: I too have a couple of fairly unsophisticated pet projects in there that the future won't be interested in individually - but collectively is another matter. One of the most interesting things our advisory committee told us is that history is replete with lists composed by wealthy people of the books they thought most important, carefully preserved for posterity, whereas what modern historians _really_ want is ordinary people's shopping lists, of which almost none survived. That's one reason there are millions of repos in the Arctic now, instead of eg just the most-starred 100K: they're the modern technological equivalent of Renaissance shopping lists, for the historians who may take a particular interest in this (possibly) especially wacky and volatile era.

I know it's an inherently cinematic and dramatic project and so it's tempting to call it a PR stunt - but I assure you it's not, and, speaking personally, I would never have gotten involved with it if I thought it was.

David. said...

Thank you for commenting, Jon. I still find it extremely difficult to come up with any scenario in the foreseeable future where access to the copies in Svalbard would be preferred to access to one of the multiple easier-to-access copies at Software Heritage, the Internet Archive, or even the Bodleian. And so does The Register's headline writer. Thomas Claburn's Imagine surviving WW3, rebuilding computers, opening up GitHub's underground vault just to relive JavaScript reports that:

"Microsoft's GitHub on Thursday said that earlier this month it successfully deposited a snapshot of recently active GitHub public code repositories to an underground vault on the Norwegian archipelago of Svalbard.

GitHub captured every repo with at least one star and any commits dating back a year from February 2, 2020, and every repo with at least 250 stars, in an archival snapshot. The copied code consists of the HEAD of the default branch of each repo, apart from binaries exceeding 100KB, packaged as a single TAR file."

Which is great as far as it goes. After quoting and linking to this post (thanks!) Claburn points out that:

"GitHub in fact is thinking more broadly about code preservation than just burying boxes in a Norwegian mine. The Arctic Code Vault is just one aspect of the GitHub Archive Program, which also encompasses "hot," "warm," and "cold," backup sources, where temperature refers to the frequency of updates.

Thus, GH Torrent and GH Archive provide "hot" storage that gets updated with recent events like pull requests. The Internet Archive and the Software Heritage Foundation provide "warm" GitHub archives that get updated occasionally.

Then there's "cold" storage like the Arctic Code Vault and Oxford University’s Bodleian Library, which will house a copy of the Svalbard data. Bugs in cold storage will be preserved in perpetuity or until cataclysm, whichever comes first."

So, for access to the Svalbard copies to be necessary, a scenario needs to explain how come Github, GH Torrent, GH Archive, the Internet Archive, the Software Heritage Foundation and the Bodleian's collection have all become inaccessible, while access to Svalbard and the technology needed to decode the QR codes remain available. I'd be happy to acknowledge a scenario that did so.

Like the Long Now's clock, the Svalbard archive is still valuable. But just as no-one will use the Long Now's clock to tell time, no-one will need to access the data in Svalbard except perhaps for integrity checks.

Curt J. Sampson said...

Well, one scenario is total collapse of civilisation and destruction of the Internet Archive, Bodleian Library, and so on. Nobody will be going in with a Raspberry Pi a few years later to snarf up all those data in the Arctic Code Vault, but at some point, after computers have been re-invented, the data should still be there ready for historians to have a go at it.