Tuesday, November 19, 2019

Seeds Or Code?

I'd like to congratulate Microsoft on a truly excellent PR stunt, drawing attention to two important topics about which I've been writing for a long time, the cultural significance of open source software, and the need for digital preservation. Ashlee Vance provides the channel to publicize the stunt in Open Source Code Will Survive the Apocalypse in an Arctic Cave. In summary, near Longyearbyen on Spitzbergen is:
the Svalbard Global Seed Vault, where seeds for a wide range of plants, including the crops most valuable to humans, are preserved in case of some famine-inducing pandemic or nuclear apocalypse.
Nearby, in a different worked-out coal mine, is the Arctic World Archive:
The AWA is a joint initiative between Norwegian state-owned mining company Store Norske Spitsbergen Kulkompani (SNSK) and very-long-term digital preservation provider Piql AS. AWA is devoted to archival storage in perpetuity. The film reels will be stored in a steel-walled container inside a sealed chamber within a decommissioned coal mine on the remote archipelago of Svalbard. The AWA already preserves historical and cultural data from Italy, Brazil, Norway, the Vatican, and many others.
Github, the newly-acquired Microsoft subsidiary, will deposit there:
The 02/02/2020 snapshot archived in the GitHub Arctic Code Vault will sweep up every active public GitHub repository, in addition to significant dormant repos as determined by stars, dependencies, and an advisory panel. The snapshot will consist of the HEAD of the default branch of each repository, minus any binaries larger than 100KB in size. Each repository will be packaged as a single TAR file. For greater data density and integrity, most of the data will be stored QR-encoded. A human-readable index and guide will itemize the location of each repository and explain how to recover the data.
Follow me below the fold for an explanation of why I call this admirable effort a PR stunt, albeit a well-justified one.

Github's CEO, Nat Friedman, explains why they're doing this:
Open source software, in his view, is one of the great achievements of our species, up there with the masterpieces of literature and fine art. It has become the foundation of the modern world—not just the internet and smartphones, but satellites, medical devices, scientific tools, robots.
I have always believed this, as I wrote in 2013:
Software, and in particular open source software is just as much a cultural production as books, music, movies, plays, TV, newspapers, maps and everything else that research libraries, and in particular the Library of Congress, collect and preserve so that future scholars can understand our society.
Vance provides just one clue as to why it is a stunt:
If the world is ravaged to the point where Svalbard is the last repository of usable wheat and corn seeds, the source code for YouTube will probably rank pretty low on humankind’s hierarchy of needs.
Lets try to construct a plausible scenario in which the films in the cave in Svalbard would be useful in the foreseeable future.

Our story starts in a Mad Max world. A huge Coronal Mass Ejection (CME) has destroyed the world's electrical grid and everything it powers. Microsoft's servers and backup tapes are history. So are the Software Heritage collection and all the contributors' copies of their work. The Internet is down. The zombie apocalypse, climate change, and a genetically engineered bio-weapon have wiped out most humans, and society has collapsed. But a plucky group of survivors has a solar powered Raspberry Pi, a Pi camera, a micro-SD card with Raspbian, and an 24 TB USB flash drive. They had all been proactively sealed in a coffee can, so they survived the CME. They know of films in a mine in Svalbard with the contents of Github. They set out to find the films and thus save the world.

Svalbard by Oona Räisänen
CC-BY-SA4.0
First, they have to get to Svalbard and then get back. These days it is much easier than it was in WW II, when my father served on Arctic Convoy escorts, or in the summer of 1969 when it took an icebreaker a few days sailing to get a dozen Cambridge University geologists and geology students (me included) there at the start of the summer. And a few days at the end of summer to get us back across the Barents Sea through a storm severe enough to sink a similar ship and sweep one of our helicopters off the flight deck. The hull of an icebreaker is semi-circular in cross-section so they ride up above ice encroaching from the sides. So the rolling of an icebreaker in rough seas has to be experienced to be believed.

Nowadays there are scheduled flights to LYR. Of course in any scenario where the films would be needed you couldn't fly there, and you probably couldn't fuel a ship sturdy enough to survive the much worse weather after a few decades of global warming (our icebreaker had to be accompanied by a tanker ship to last the season). You can't walk there. Even today you'd be very lucky to survive the weather during a round-trip from Tromsø in a sailing yacht without access to forecasts or GPS. And you'd need to get to Tromsø, which is a long walk from anywhere our survivors would be likely to start from on the Eurasian land mass.

The voyage
Lets assume that our heroes make it to Tromsø and find a nuclear submarine which just happened to be moored to the wharf with its hatches closed and antennae retracted when the CME hit, so it survived unscathed. The crew, however, after mooring it and closing the hatches, went to sample the nightlife of Tromsø, so they succumbed to the bio-weapon or the zombies. Once they get the hatches open, our heroes find a printed copy of Getting Started With Your New Submarine on the control desk. And fortunately the crew hadn't bothered to change the sub's default password (admin/admin).

Being very lucky with the weather our heroes manage to navigate by the stars and dead reckoning to Longyearbyen. Global warming means they don't have to worry, as we did, about polar bears and there will be plenty of dead seals whose blubber can light their way into the cave.

By the flickering light of seal-blubber torches our heroes creep cautiously into the cave and find the repository where, decades ago:
Friedman comes to what looks like a metal tool shed. ... Friedman unlocks the container door with a simple door key and, inside, deposits much of the world’s open source software code.
But where is the key? Fortunately, one of our heroes is an expert lock-picker, so it is mere moments before they are face-to-face with:
200 platters, each carrying 120 gigabytes of open source software code,
Along with:
Vatican archives, Brazilian land registry records, loads of Italian movies, and the recipe for a certain burger chain’s special sauce
They drag the 200 platters out into the 24hr sunshine, plug the solar panel into the Raspberry Pi, point its camera through a magnifying glass at the first frame, and let the QR app they happen to have on the Pi's micro--SD card do its thing. A couple of seconds later they have the first 2,900 bytes on the USB drive. It takes another couple of seconds to move to the next frame by hand. So they sit there for 383 days scanning a frame every 4 seconds to decode the entire archive. Except there's only sunshine enough for the Pi half the year, so it takes rather more than two years.

Then they need to start the Pi building all that code ...

Of course, this is ridiculous. No-one will decode this archive in the foreseeable future. It is a PR stunt, or perhaps more accurately a koan, like the golden records of Voyager 1 and 2, or the Long Now's clock, to get people to think about the importance of the long term. Such koans have considerable value, as I wrote recently:
IIRC it was at the 1996 Hacker's conference where Hillis and Brand talked about the idea of the clock. The presentation set me thinking about the long-term future of digital information, and about how systems to provide it needed to be ductile rather than, like Byzantine Fault Tolerance, brittle. The LOCKSS Program was the result a couple of years later.
Vance writes:
In the range of possible futures in which humanity has working modern computers, but no software to run on them, the archive and its Tech Tree could be extremely valuable. However, the value is more likely to be historical, perhaps ensuring that today’s technology is not lost by a tomorrow that carelessly considers it irrelevant—until an unexpected use for our software is discovered.
Actually, that range of futures would face a nearly insuperable bootstrap problem, so the range would effectively be zero. That's why our heroes' Raspberry Pi in a coffee can needed a micro-SD card with Raspbian, to avoid the need to bootstrap the software stack. How you would actually bootstrap the software stack if you needed to is left as an exercise to the reader. Some clues are here:
One sub-project is Stage0:
Stage0 starts with just a 280byte Hex monitor and builds up the infrastructure required to start some serious software development. With zero external dependencies, with the most painful work already done and real languages such as assembly, forth and garbage collected lisp already implemented.
The current 0.2.0 release of Stage0:
marks the first C compiler hand written in Assembly with structs, unions, inline assembly and the ability to self-host it's C version, which is also self-hosting
There is clearly a long way still to go to a bootstrapped full toolchain.
PDP-1 Console
By fjarlq/Matt CC-BY 2.0
But think about the lack of switches like the PDP-1s.

Vance is right that it might be a useful historical artifact. But that isn't really a good justification for the effort compared to its effect in pushing back on today's short-termism.

If it were some other organization I would be pointing out that this is an extraordinarily expensive way to preserve 24TB of data, but Microsoft can afford it. It is, however, important to observe that the films are only part of this project. Microsoft understands that:
Archiving software across multiple organizations and forms of storage will help ensure its long-term preservation: online archivists call this "LOCKSS," for Lots Of Copies Keeps Stuff Safe.
...
Even in the near future, storing data with multiple partners provides options to people whose access might otherwise be restricted. If GitHub were to become unavailable in any location, for example due to an internet routing issue, those affected could access public code for their projects using the Internet Archive and Software Heritage Foundation.
They are using a range of technologies, making feeds available over the Internet, and partnering with the Internet Archive, the Software Heritage Foundation and the Bodleian Library. These are mostly things which will get used in the foreseeable future, and should be applauded for that reason.

PS: The correct English is LOCKSS standing for Lots Of Copies Keep Stuff Safe. Not "Keeps": I should know. And it is a trademark of Stanford University.

2 comments:

Dragan Espenschied said...

If I had one Euro for each digital preservation project that shoots data into oblivion—to the moon, into abandoned mine shafts, under the eternal ice—I could already buy a hard drive.

David. said...

Ooops - I forgot to provide our plucky band of survivors with some way to talk to the Raspberry Pi, so they would be stuck after they dragged the films out into the sunlight. Somehow a display, a keyboard and some means to power them must have survived the CME. Or maybe it was a very large coffee can with a Raspberry Pi laptop inside.