Tuesday, January 15, 2013

Moving vs. Copying

At the suggestion of my long-time friend Frankie, I've been reading Trillions, a book by Peter Lucas, Joe Ballay and Mickey McManus. They are principals of MAYA Design, a design firm that emerged from the Design and CS schools at Carnegie-Mellon in 1989. Among its founders was Jim Morris, who ran the Andrew Project at C-MU on which I worked from 1983-85. The ideas in the book draw not just from the Andrew Project's vision of a networked campus with a single, uniform file name-space, as partially implemented in the Andrew File System, but also from Mark Weiser's vision of ubiquitous computing at Xerox PARC. Mark's 1991 Scientific American article "The Computer of the 21st Century" introduced the concept to the general public, and although the authors cite it, they seem strangely unaware of work going on at PARC and elsewhere for at least the last 6 years to implement the infrastructure that would make their ideas achievable. Follow me below the fold for the details.

The book's argument is long and complex, but it starts with a simple observation:
There are now more computers in the world that there are people. Lots more. In fact there are now more computers, ... manufactured each year than there are living people. ... Accurate production numbers are hard to come by, but a reasonable estimate is is ten billion processors a year. ... And that means that a near-future world containing trillions of computers is simply a done-deal. ... We are building the trillion-node network, not because we can but because it makes economic sense. (pg 25-28)
They argue that our current techniques for designing and building everything from human-computer interaction through applications to networks themselves are inadequate:
We won't just have trillions of computers, we will have a trillion-node network. Done deal. The unanswered question is how, and how well, we will make it work. (pg 32)
The basic change they identify is:
The proliferation of devices has had the effect of bringing about a gradual but pervasive change of perspective: The data are no longer in the computers. We have come to see that the computers are in the data. In essence, the idea of computing is being turned inside out. (pg 36)
At the human interface level their vision is nicely summed up by William Gibson in Distrust That Particular Flavor:
I imagine that one of the things our great-grandchildren will find quaintest about us is how we had all these different function-specific devices. Their fridges will remind them of appointments and the trunks of their cars will, if need be, keep the groceries from thawing. The environment itself will be smart, rather than various function-specific nodes scattered through it. Genuinely ubiquitous computing spreads like warm Vaseline. Genuinely evolved interfaces are transparent, so transparent as to be invisible. (pg 84)
Its a compelling if now somewhat trite vision, and the book has much worthwhile to say on design methodologies for bringing it to reality. But I'm much more interested in the infrastructure that they envisage to implement it. Their trillion-node network has "three basic facets" (pg 42):
  • fungible devices by which they mean a modular architecture that allows devices, even low-level devices, to inter-operate by exchanging messages over the network so that they can participate in "an ecology of information devices" (pg 48) as opposed to being trapped in a walled garden.
  • liquid information by which they mean (pg 53) "the ability of information to flow freely where and when it is needed". They use the analogy of shipping containers, which are moved in various ways none of which depend on their contents, to suggest a Little Box Transport Protocol (LBTP) that would subsume all the special-purpose protocols such as HTTP, SMTP, POP, IMAP, NTP and so on:
    Such a change would ... vastly increase information liquidity [because] ... both data transport and data storage would become much more standardized and general. ... This becomes increasingly important in the age of Trillions, since it permits relatively simple devices to retain generality. (pg 57)
    It would also encourage appropriate structuring of information:
    since the boxes are assumed to be relatively small, designers will be forced to break up large blocks of data ... into smaller units. ..competent engineers ... will tend to make their slices at semantic boundaries. (pg57)
  • a genuine cyberspace by which they mean a public, shared information space, implemented by:
    Each box has a unique number on its outside, The numbers are called universally unique identifiers (UUIDs). The number identifies the container, regardless of where it is located and regardless of whether the contents of the box change. ... We say public because the little boxes are capable of maintaining their identity independently of where they are stored or who owns them. (pg 63)
    In other words, every information unit in this space has an analog of a Universal Resource Identifier (URI) but it is not a Universal Resource Locator (URL).
The problem with the URLs such as http://www.example.com/index.html that populate the Web is that they tell you not just how to access the information resource but more importantly where to access it, at the server whose DNS name is typically the element after the "scheme name" such as http; in the example it is www.example.com. The URL says to get the content of the resource send www.example.com an HTTP request for index.html. In the author's vision, by contrast, the URI for the resource would look like lbtp:6B860FBD18AA and to get the content of the resource you would issue an LBTP request for 6B860FBD18AA. You don't specify where to send the request; you ask the network and its the network's job to locate (a copy of) the resource and make you a local copy of it.

Naming information resources by pointing to their location creates many fundamental problems. In particular, it means that an archived copy of a resource cannot replace the original if it becomes lost, because its name is different. There is much to like in the author's vision, but from my perspective one of the more important ones is that, unlike the Web, this would be an archive-friendly environment.

In the Web, all the URLs you coin for resources you create include a DNS name. it is either a name in a domain you own, or in someone else's domain. In the first case, if you stop paying your domain registrar all the URLs you coined and all the links that use them will break instantly. In the second case, the owner of the domain can break all your links at a whim.

Notice that in the author's vision there is no DNS name in the URI, it isn't a URL. Once coined, the URI is never reused. The content of the box it identifies can be changed or deleted, but the URI is permanent. A request for the URI may fail, if the network is unable to find any copy of the box, but the URI can never be reassigned to another box. That's what "universally unique" means.

So, the vision is that the network consists of a vast cloud of boxes each with an identifier and some data. A node can do three things in this network:
  • it can request the network to create a local copy of a box,
  • it can satisfy a request from another node for a box of which it has a local copy,
  • and it can create a box with an identifier and some data.
The language is different, but this is the same underlying model of how a network should work that inspired the Content-Centric Network (CCN) project at PARC. Van Jacobson and a team have been working since at least 2006, when he gave a Tech Talk at Google, to implement a radically different network stack that works in just this way.  The introductory paper is here (PDF) and some readable slides here (PDF). Although the CCN team don't talk about their ideas this way, in my mind the radical change they make is this:
  • The goal of IP and the layers above is to move data. There is an assumption that, in the normal case, the bits vanish from the sender once they have been transported, and also from any intervening nodes.
  • The goal of CCN is to copy data. A successful CCN request creates a locally accessible copy of some content. It says nothing about whether in the process other (cached) copies are created, or whether the content is deleted at the source. None of that is any concern of the CCN node making the request, they are configuration details of the underlying network.
While it has its copy, the CCN node can satisfy requests from other nodes for that content, it is a peer-to-peer network. Basing networking on the copy-ability of bits rather than the transmissibility of bits makes a huge simplification.

There is one big difference between the language that Trillions uses and the more careful language that CCN uses. Trillions' "little boxes" carry the implication that they are containers for information. CCN is careful not to talk about information containers, because the idea that information is contained in something implies that the information is somewhere, namely it is where the box is. This is a basic problem with the "little box" analogy. Trillions doesn't actually make this mistake. The authors understand that there are many copies of the information identified by the URI of the box, and that Lots Of Copies Keep Stuff Safe:
move to a peer-to-peer world where every machine could be a client and a server at the same time, and participate in a global scheme of data integrity via massive, cooperative replication." (pg 105)
CCN uses the word "collection" for the unit of information identified by a URI, and this carries no implication that the information is anywhere in particular.

The advent of the Web has meant that the user model of the Internet has changed from Senator Stevens' "a series of tubes" that allows computers to send each other data, to a vast store of data that your computer can access. The problem is that the current network stack forces parts of the "tubes" model to stick out through the user interface. CCN makes the way the network stack works match the user model.

Because the goal of CCN is to create copies of content wherever they are needed, system designers can no longer kid themselves that they can protect information by preventing the bits being copied. Anyone can get the bits simply by asking for them, that is the whole point of CCN. Even now, the only way to protect information is to encrypt it; futile attempts to prevent it being copied are obviously failing. CCN uses sophisticated encryption techniques, both to prevent unauthorized access to the content, and to prevent the kinds of attacks that involve altering the URIs of the little boxes.

Van Jacobson and the ideas from CCN are now part of the Named Data Networking (NDN) project, a much larger effort involving teams from 11 US Universities funded by the NSF in 2010 as one of four Future Internet Architecture projects.

The chief failing of Trillions is its naive optimism about the business and technical problems of not just making their vision work, but of getting it to displace the current infrastructure. The technical optimism is evident both in their ideas about storage management and costs:
Exactly which information ends up being stored on which disk drive is controlled by no one - not even the owner of the disk drive. You don't get to decide which radio waves travel through your house on their way to your neighbor's set - your airspace is a public resource. In this vision, the extra space on your disk drive becomes the same kind of thing, This might be controversial if all that extra space wasn't essentially free - but it is, and is all but guaranteed to become more so as Moore's law ticks away. (pg 106)
and in the contrast between the hand-waving "little boxes" and CCN's detailed description of solutions to important problems such as routing, flow control, naming and security:
We will eschew the complexity of the Internet at the lowest levels of our infrastructure with good riddance. (pg 299)

There have been projects seduced by the lure of all that "free" unused storage sitting on people's desktops for most of the history of desktop computing. None of them have taken the world by storm. This might have something to do with the huge nexus between industry and government devoted to finding out whether you have paid for the information on your hard disk and suing you if they don't like what they find. And now desktop computing with huge 3.5" disks, and even laptops with fairly large 2.5" disks are fading into history.

Their business optimism is evident in their blithe discussion of the business and legal obstacles to deployment:
Getting past the client-server model will be a long slow slog. Technological inertia and backward-looking economic interests will conspire to ensure this. But slowly the barriers to progress will fall. (pg 307)
They suggest some economic advantages that would accrue if their vision were implemented, but really don't discuss how the trillion-dollar companies enabled by the current model would be prevented from strangling it at birth. This is the more of a problem since the advantage of their vision is precisely that it avoids the kind of centralization that would allow an alternative group of equally powerful companies emerging.

Deployment is a problem for CCN and NDN too, but they are at least attempting to address it in an incremental way that involves researchers at telecom companies. The CCN implementation is free and open source, and runs on top of IP; IP can run on top of CCN. As Van Jacobson points out (PDF):
IP started as an overlay on the phone system; today the phone system is an overlay on IP.
But a path by which CCN could displace TCP, UDP and the forest of protocols that run on them, is still not clear.

Trillions is right to call attention to the fundamental problems of the current network stack. But for a book doing so in 2012 not to refer to a major research effort to address them that started at least 6 years earlier and is now one of four being funded by the NSF seems odd. And the critical flaw in the book is naive faith that the market will ensure that the economic benefits of the new architecture will win out in the end over the determined opposition of the incumbents. This is all too reminiscent of the techno-optimism of the Internet's early days that has led to a duopoly owning the Internet in the United States, and charging the customers the developed world's highest prices for the developed world's worst broadband.

2 comments:

Andy Jackson said...

Nice summary. Reminded me of Magnet URIs - URIs for content that are already in use and supported by many BitTorrent-style P2P network clients.

David. said...

Martin Geddes has a well-argued review of the end-to-end argument that provides a different route to the same end-point as CCN.