For a long time, companies like Amazon, Google, Facebook and institutions like the Internet Archive have been building large-scale, long-term storage by assembling racks of servers full of disks connected via IP networks. As these systems scaled up, the major concern became the cost of keeping all this stuff. The primary cost drivers were the capital cost of the hardware and the power and cooling it required.
Back in 2009, a team from C-MU published FAWN: A Fast Array of Wimpy Nodes, showing that the power and cooling of storage could be radically reduced without sacrificing performance by, in effect, eliminating the servers and getting IP connectivity right to the storage media. In their case the media were a combination of a very low-power processor and a fairly small amount of flash. The media were accessed over the IP network via an API that exposed a key-value object store, which is very similar to the storage API in state-of-the-art storage systems such as OpenStack Swift. FAWN's focus was on performance; I explained where the performance was coming from in this 2010 blog post. In 2011 Ian Adams, Ethan Miller and I published DAWN: A Durable Array of Wimpy Nodes, showing that even if your focus was on long-term archival storage FAWN-like architectures made sense.
In recent years Dave Anderson of Seagate has several times pointed out that state-of-the-art disk drives actually have quite powerful, low-power CPUs on-board to run the disk firmware, which these days is a very substantial body of code. These processors are at least as powerful as the processors of the FAWN and DAWN proposals. There was even an attempt, starting in 2004 at ANSI T10, to provide the drives with an object storage command set. What the drives didn't have was IP networking.
Last week Seagate changed all that. They announced:
- An API implemented in Java, Python and Erlang for accessing media with IP connectivity and key-value storage semantics.
- A simulator for these media so you can experiment with the API.
- A 3.5" 5900RPM 4TB drive, available next year, that has 2x1Gb Ethernet connectivity as described by Chris Mellor at The Register. The drive appears to re-purpose the SAS connector to carry the Ethernet.
and the metadata as an:
Entry(byte key, byte value, EntryMetadata metadata)
They are stored by synchronous and asynchronous
EntryMetadata(byte version, byte tag, String algorithm)
Putoperations and retrieved by synchronous and asynchronous
Getoperations specifying the key. Objects can also be retrieved by the next and previous keys to the argument key. The API also provides access to the object's metadata and to a list of the keys within a specified range. You can get the software, including the simulator, and the documentation by:
SwiftStack also announced a preview of OpenStack Swift running on Seagate's simulator:
git clone https://github.com/Seagate/Kinetic-Preview.git
Having the storage device speak in keys and values means that there is less impedance mismatch between Swift (which is native object) and traditional drives (which speak blocks). This results in more efficiency and utilization.What makes the Kinetic announcement especially interesting is that by up-levelling the media interface, it hides the details of the underlying media. The application doesn't know anything about how the objects are stored and how they are indexed.
What we have done at SwiftStack is incorporate the Kinetic drives into the OpenStack Swift storage system as open source. Then we use pools of computing resources to provide the full Swift services in the cluster (auditing, replication, request routing authentication, etc.).
I am way overdue for a blog post on developments in storage technology; I'm ashamed to admit that the last was a year ago. One thing I haven't commented on is that, in September, Seagate announced that the long-touted advent of Shingled Magnetic Recording technology had finally happened, and that they had shipped over 1M shingled drives. Tom Coughlin expects 5TB 3.5" shingled drives next year. One reason why shingling has been slow to take off is that it turns hard disk from a random-access into an append-only medium, and this has major impacts up the software stack from the disk driver via the file system all the way to the application. The importance of Kinetic in this context is obvious; it avoids the need to change all this code by allowing a very large class of applications to talk directly to the storage media, using the TCP/IP stack instead of the file system and the disk drivers. Because in the Kinetic framework the drive, not the operating system software, manages the physical location of the data, the difference in the location policies caused by the change from conventional to shingled drives is invisible to the application.
Based on what I've been able to find on the Web, I still have some questions:
- Will the wire protocol the media implement be an open standard? As far as I can see Seagate hasn't released this wire protocol, which is not a good sign.
- The software they have released so far is not Open Source, which is not a good sign. Without the wire protocol spec, an alternate API implementation would depend on reverse-engineering. What is Seagate's licensing policy for the API implementation?
- Seagate claims that the technology supports end-to-end integrity, so presumably the wire protocol does. But their API doesn't, which is not a good sign. How do applications use end-to-end integrity?
This is a very interesting move in the storage space. Unfortunately, Jeff Darcy pointed out some significant problems with this particular implementation:
1) The wire protocol will be released in the open. It is our goal to make this available to the community.
2) The Library software will be released open source.
(I made these statements publicly at the Basho Ricon-West conference yesterday).
3) Sorry, the API does have the ability to put a Tag field in the "EntryMetaData". This field is both preserved and checked by the drive on a regular basis. This is documented in the javadoc.
James, thanks for the corrections. Kinetic seems to suffer from some poor choices of name. It wasn't clear to me from the Javadoc that "Tag" meant integrity token, and I agree with Jeff Darcy when he says "That's fine for a key/value store or a deep archival store (the very opposite of "kinetic" BTW)" but I disagree with most of the rest of his post since he seems not to realize that the target market for this technology is precisely "a key/value store or a deep archival store".
One other issue I forgot to raise is the question of the drive connector. The storage media market has a long history of great benefits from non-proprietary media connectors. I assume that the re-purposing of the SAS connector is not encumbered with IP protection. But I for one would feel a lot better about adopting this technology for long-term storage if there were a second source for the media. And there's only one viable second source for Seagate's hard drives.
For the drive connector: I know of no IP restricting the connector only to only be used for SAS.
As for a second source: We are fully aware that many customers want a second source and we hope that one comes along. (which is another comment I made at Basho-Ricon meeting)
Post a Comment