Anderson et al write:
FAWN couples low-power embedded CPUs to small amounts of local flash storage, and balances computation and I/O capabilities to enable efficient, massively parallel access to data.By "low-power embedded CPUs" they meant the kind of CPUs that powered 2009 cellphones such as the iPhone 3GS, which were truly wimpy by today's standards. Both the low-power CPUs and the use of flash rather than hard disk contributed to the massive reduction in energy per query.
Our evaluation demonstrates that FAWN clusters can handle roughly 350 key-value queries per Joule of energy--two orders of magnitude more than a disk-based system.
Although in 2011 energy use was not a big worry for long-term digital preservation, FAWN started me wondering whether a similar hardware architecture could be cost-effective for the continual verification and repair digital preservation requires:
Ian Adams and Ethan Miller of UC Santa Cruz's Storage Systems Research Center and I have looked at this possibility more closely in a Technical Report entitled Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. We show that it is indeed plausible that, even at current flash memory prices, the total cost of ownership over the long term of a storage system built from very low-power system-on-chip technology and flash memory would be competitive with disk.We continued discussing and writing about the idea until 2016's The Future Of Storage.
A second thread started with a talk at 2013's Chaos Computer Conference by the amazing Bunnie Huang and his colleague xobs entitled "On Hacking MicroSD Cards". We tended then to think of storage media, especially the tiny MicroSD cards, as "devices". But they are actually embedded computers with CPU, memory, storage and I/O; Bunnie and obs made this clear by replacing their "firmware".
Then in 2015 came the revelation that for many years the NSA's "Equation Group" had been exploiting the fact that storage "devices" were actually embedded computers by installing malware in the computers inside hard disks. Dan Goodin's How “omnipotent” hackers tied to NSA hid for 14 years—and were found at last had the details:
One of the Equation Group's malware platforms, for instance, rewrote the hard-drive firmware of infected computers—a never-before-seen engineering marvel that worked on 12 drive categories from manufacturers including Western Digital, Maxtor, Samsung, IBM, Micron, Toshiba, and Seagate.I noted:
The malicious firmware created a secret storage vault that survived military-grade disk wiping and reformatting, making sensitive data stolen from victims available even after reformatting the drive and reinstalling the operating system. The firmware also provided programming interfaces that other code in Equation Group's sprawling malware library could access. Once a hard drive was compromised, the infection was impossible to detect or remove.
Kinetic, an object-based instead of block-based API for drives, intended to support drives with Ethernet connectivity. Not to be outdone, seven months later Western Digital announced Ethernet-connected drives running Linux.
None of these efforts made much progress. The reason was the same as for the later introduction of shingled magnetic recording. Both required major restructuring of the O/S support for disks, and changing the hardware is much easier than changing the software.
Now, around a decade later, comes the joint work of Seagate and Los Alamos. Mann describes the application:
of the trillions of particles generated in an HPC simulation, researchers may be interested in the behavior of a just few hundred or thousand.Los Alamos has been working on moving computation to the devices for a while:
“You are not necessarily looking for a needle in a haystack, but you’re looking for something and it is usually a small subset of the data,” Gary Grider, deputy division leader at Los Alamos National Laboratory, tells The Next Platform. And while this might not be a big problem on smaller datasets, it can be particularly challenging at the scales LANL is accustomed to. “We might run a job that might be a petabyte of DRAM and it might write out a petabyte every few minutes,” Grider emphasized. And do that for six months.
Los Alamos researchers have already had some success in this regard. Working with SK Hynix, they were able to prove the concept by shifting the reduction function onto the drive’s controller, achieving multiple orders of magnitude improvement in performance in the process.Now they want to scale up:
“We’ve shown that when we can actually do analytics – simple analytics like reductions – at the full rate that the disk drive can pull the data off the disk, and what that means is there’s no cost to it from a bandwidth point of view,” Grider says.
Los Alamos, like many other Department of Energy HPC labs, employs a tiered storage architecture, and so the lab began investigating ways to achieve similar results on their larger disk pool. And to do that, Los Alamos has entered into a collaborative research and development agreement with Seagate.Mann has the plan:
“It turns out Seagate had already been working on some offloads to devices,“ Grider explains. “They have this prototype device that has a processor that is right next to the disk drive.”
LANL opted to modify an existing file system – the Zettabyte File System created by the former Sun Microsystems so long ago specifically for large, resilient pools of disk drives we all lovingly call spinning rust these days. ZFS can also be scaled to multiple nodes using Gluster, which is a clustered file system that was acquired by Red Hat a long time ago. Meanwhile for analytics, engineers worked to adapt the file system for use with Apache’s analytics stack.That is just a start:
Grider notes there is a lot of work to be done. “It’s going to be a fairly long road to get to the point where this is consumable,” he says. “The next thing that we’ll be doing is to turn this into some sort of object model instead of blocks under files.”See Seagate's decade-old Kinetic! But in the meantime the approach yields benefits:
“We’re not moving the entire analytics workload to disk drives – largely it’s just a reduction, and some joining and things like that.” Grider says. “But from a simplified point of view, reductions happen at the device level, and then joins and sorts and things like that typically happen somewhere in flash or memory.”But also reveals problems:
The limiting factor, however, isn’t processing power so much as the tiny amount of memory built into each disk. “It doesn’t have enough memory to do a sort, it can only do a select,” Grider said of the individual drives. “The question is more how do you put enough memory on there to do something more than fairly simple things.”I see two problems here:
- The reason data centers, including HPC, have most of their data on hard drives is the low $/GB. Drives with the extra memory LANL want would be a more expensive niche product. I'm skeptical that HPC is a big enough market to support such a niche product.
- There are really strong limits on what drives that only know about data as blocks, not as files or objects, can do. We haven't been successful so far at restructuring the storage stack in operating systems to up-level the device interface.