Anderson et al write:
FAWN couples low-power embedded CPUs to small amounts of local flash storage, and balances computation and I/O capabilities to enable efficient, massively parallel access to data.By "low-power embedded CPUs" they meant the kind of CPUs that powered 2009 cellphones such as the iPhone 3GS, which were truly wimpy by today's standards. Both the low-power CPUs and the use of flash rather than hard disk contributed to the massive reduction in energy per query.
...
Our evaluation demonstrates that FAWN clusters can handle roughly 350 key-value queries per Joule of energy--two orders of magnitude more than a disk-based system.
Although in 2011 energy use was not a big worry for long-term digital preservation, FAWN started me wondering whether a similar hardware architecture could be cost-effective for the continual verification and repair digital preservation requires:
Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I have looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We show that it is indeed plausible that, even at current flash memory prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk.We continued discussing and writing about the idea until 2016's The Future Of Storage.
A second thread started with a talk at 2013's Chaos Computer Conference by the amazing Bunnie Huang and his colleague xobs entitled "On Hacking MicroSD Cards". We tended then to think of storage media, especially the tiny MicroSD cards, as "devices". But they are actually embedded computers with CPU, memory, storage and I/O; Bunnie and obs made this clear by replacing their "firmware".
Then in 2015 came the revelation that for many years the NSA's "Equation Group" had been exploiting the fact that storage "devices" were actually embedded computers by installing malware in the computers inside hard disks. Dan Goodin's How “omnipotent” hackers tied to NSA hid for 14 years—and were found at last had the details:
One of the Equation Group's malware platforms, for instance, rewrote the hard-drive firmware of infected computers—a never-before-seen engineering marvel that worked on 12 drive categories from manufacturers including Western Digital, Maxtor, Samsung, IBM, Micron, Toshiba, and Seagate.I noted:
The malicious firmware created a secret storage vault that survived military-grade disk wiping and reformatting, making sensitive data stolen from victims available even after reformatting the drive and reinstalling the operating system. The firmware also provided programming interfaces that other code in Equation Group's sprawling malware library could access. Once a hard drive was compromised, the infection was impossible to detect or remove.
this early supply chain attack led drive manufacturers to secure their firmware update mechanism.A third thread involved up-leveling the interface to hard drives. In 2013 Seagate announced Kinetic, an object-based instead of block-based API for drives, intended to support drives with Ethernet connectivity. Not to be outdone, seven months later Western Digital announced Ethernet-connected drives running Linux.
None of these efforts made much progress. The reason was the same as for the later introduction of shingled magnetic recording. Both required major restructuring of the O/S support for disks, and changing the hardware is much easier than changing the software.
Now, around a decade later, comes the joint work of Seagate and Los Alamos. Mann describes the application:
of the trillions of particles generated in an HPC simulation, researchers may be interested in the behavior of a just few hundred or thousand.Los Alamos has been working on moving computation to the devices for a while:
“You are not necessarily looking for a needle in a haystack, but you’re looking for something and it is usually a small subset of the data,” Gary Grider, deputy division leader at Los Alamos National Laboratory, tells The Next Platform. And while this might not be a big problem on smaller datasets, it can be particularly challenging at the scales LANL is accustomed to. “We might run a job that might be a petabyte of DRAM and it might write out a petabyte every few minutes,” Grider emphasized. And do that for six months.
Los Alamos researchers have already had some success in this regard. Working with SK Hynix, they were able to prove the concept by shifting the reduction function onto the drive’s controller, achieving multiple orders of magnitude improvement in performance in the process.Now they want to scale up:
“We’ve shown that when we can actually do analytics – simple analytics like reductions – at the full rate that the disk drive can pull the data off the disk, and what that means is there’s no cost to it from a bandwidth point of view,” Grider says.
Los Alamos, like many other Department of Energy HPC labs, employs a tiered storage architecture, and so the lab began investigating ways to achieve similar results on their larger disk pool. And to do that, Los Alamos has entered into a collaborative research and development agreement with Seagate.Mann has the plan:
“It turns out Seagate had already been working on some offloads to devices,“ Grider explains. “They have this prototype device that has a processor that is right next to the disk drive.”
LANL opted to modify an existing file system – the Zettabyte File System created by the former Sun Microsystems so long ago specifically for large, resilient pools of disk drives we all lovingly call spinning rust these days. ZFS can also be scaled to multiple nodes using Gluster, which is a clustered file system that was acquired by Red Hat a long time ago. Meanwhile for analytics, engineers worked to adapt the file system for use with Apache’s analytics stack.That is just a start:
Grider notes there is a lot of work to be done. “It’s going to be a fairly long road to get to the point where this is consumable,” he says. “The next thing that we’ll be doing is to turn this into some sort of object model instead of blocks under files.”See Seagate's decade-old Kinetic! But in the meantime the approach yields benefits:
“We’re not moving the entire analytics workload to disk drives – largely it’s just a reduction, and some joining and things like that.” Grider says. “But from a simplified point of view, reductions happen at the device level, and then joins and sorts and things like that typically happen somewhere in flash or memory.”But also reveals problems:
The limiting factor, however, isn’t processing power so much as the tiny amount of memory built into each disk. “It doesn’t have enough memory to do a sort, it can only do a select,” Grider said of the individual drives. “The question is more how do you put enough memory on there to do something more than fairly simple things.”I see two problems here:
- The reason data centers, including HPC, have most of their data on hard drives is the low $/GB. Drives with the extra memory LANL want would be a more expensive niche product. I'm skeptical that HPC is a big enough market to support such a niche product.
- There are really strong limits on what drives that only know about data as blocks, not as files or objects, can do. We haven't been successful so far at restructuring the storage stack in operating systems to up-level the device interface.
Hi, David. Former Gluster maintainer (now retired) and Kinetic critic here. We might even have met once at FAST or some such. This is an interesting development, all right. Doing a select (or similar) at the device seems like the right level of abstraction to me. "Give me a list of file/object/whatever names matching this pattern" is likely to be cheaper than "give me all and I'll sort them out". Going further than that, especially scanning over the actual data (e.g. for checksums) risks overloading CPU, memory, or even the bus between them. Doing direct device-to-device transfers is another possible win. Rebalancing was one of my main interests both on Gluster and on a different exabyte-level system at FB, and bouncing data through a third node always seemed wasteful. I hope the folks at LANL find more ways to get wins out of this approach.
ReplyDeleteHey David,
ReplyDeleteGlad to see our old work getting a mention again!
I'm actually still working on computational storage/NDP work these days. That said, despite my fervent hope that *something* will stick, I'm not very optimistic for a few reasons, some of which you touch on.
With our last attempts (See our HotStorage paper in 2019) we weren't able to make anybody totally happy it seemed. It was easy to use (just a simple client library, hand it a file path and an opcode + arguments and off you go), but other folks would insist on full programmability. Full programmability gave some folks the heeby-jeebies + the increased cost on a device (See collapse of NGD systems) meant it really wasn't economical, and was more of a curiosity. There also still isn't a "killer app" that has wide enough applicability to really give legs to such complex hardware.
That said, I think some of the concepts we've been developing for computational storage *devices*, apply more generically for near-data-processing at the server level for block and object storage which still buys you something for large scale-out systems. Shameless plug for our paper HotStorage 2021 paper on dealing with sharded/erasure coded data for near-data-processing.
As an aside, I was at seagate when they were pushing Kinetic and there were some internal politics at play in its failure, but also generally it occupied an awkward space. As you point out, it required a big change in stacks to use it at scale, but it didn't really offer any meaningful features to make it worthwhile to do such upheaval. it also gave DC operators hives when they found out they had ten zillion new IP addressable devices to deal with. We tried to get them to lean in on some of its more interesting characteristics, but they just didn't give a rip or pushed back with "too hard to do with too little computational + memory resources".