Thursday, February 7, 2019

Cloud For Preservation

Imagine you're responsible for preserving the long-established digital collection at a large research or national library. It is currently preserved in home-grown software, or off-the-shelf software that's been extensively customized, that you are responsible for running on hardware run by your institution's IT department. You are probably not a large customer of theirs. They are probably laying down the law, saying "cloud first", especially as you are looking at a looming hardware refresh. Below the fold, I examine a set of issues that need to be clarified in the decision-making process.

Executive Summary

To summarize the following discussion, the following issues need to be considered when outsourcing long-term digital preservation to the cloud:
  • The effect of egress charges on demand spikes and vendor lock-in.
  • Whether the reliability the service advertises is adequate, and whether it is actually delivered.
  • Whether integrity verification is possible and affordable.
  • The differences in security environment between on-premise and cloud storage.
  • Whether legacy applications can use the API of the cloud storage service.
  • Whether Kryder's Law will continue to deliver rapid decreases in the cost of storage, and whether if it does they will be reflected in cloud storage service pricing. 
  • On-premise vs. cloud storage economics.
  • The terms and conditions in the service's Customer Agreement, including the disclaimers of liability and the jurisdiction for dispute resolution.
Source
Each of these issues is covered in some detail below. They are not a substitute for, but should be considered in the context of, an explicit threat model for the archive. Examples of suitable threat models are those for the LOCKSS system and the CLOCKSS Archive's TRAC audit. Mitigating or responding to some of the plausible threats involves changing cloud providers; planning for this eventuality is important.

Economic Vendor Lock-In

Unlike access, preservation necessarily has a long time horizon. The Digital Preservation Network wanted its members to make a 20-year commitment, but they didn't. One of the important lessons they want the community to learn from its failure is:
While research universities and cultural heritage institutions are innately long-running, they operate on that implicitly rather than by making explicit long-term plans.
Much of the attraction of cloud technology for organizations, especially public institutions funded through a government's annual budget process, is that they transfer costs from capital to operational expenditure. It is easy to believe that this increases financial flexibility. As regards ingest and dissemination, this may be true. Ingesting some items can be delayed to the next budget cycle, or the access rate limit lowered temporarily. But as regards preservation, it isn't true. It is unlikely that parts of the institution's collection can be de-accessioned in a budget crunch, only to be re-accessioned later when funds are adequate. Even were the content still available to be re-ingested, the cost of ingest is a significant fraction of the total life-cycle cost of preserving digital content.

Despite the difficulty of reconciling preservation's long time horizon with an annual budget process, the limited life of on-premise hardware provides regular opportunities to switch to lower-cost vendors. Customer acquisition is a significant cost for most cloud platforms; they need to reduce churn to a minimum. Lock-in tends to be an important aspect of cloud service business models.

To illustrate the effect of lock-in for preservation for three types of services, the tables show the cost of ingesting a Petabyte in a month over the network, then storing it for a year with no access, and then extracting it over the network in a month. The last column is the approximate number of months of storage cost that getting the Petabyte out in a month represents. First, services with associated compute platforms, whose lock-in periods vary:

Compute Platform Services
Service In Store Out Total Lock-in
AMZN S3 $10,825 $258,600 $63,769 $333,194 3.0
GOOG Nearline $5,100 $120,000 $85,320 $210,420 8.5
MSFT Cool $6,260 $150,000 $16,260 $172,520 1.3

Second, "archival" storage services, typically with long lock-in:

Archival Services
Service In Store Out Total Lock-in
AMZN Glacier $2,250 $48,000 $55,240 $105,490 13.8
GOOG Coldline $3,600 $84,000 $83,860 $171,460 12.0
MSFT Archive $6,350 $24,000 $16,260 $46,610 8.1

Note that Amazon recently pre-announced an even colder and cheaper storage class than Glacier, Glacier Deep Archive:
S3 Glacier Deep Archive is a new Amazon S3 storage class that provides secure, durable object storage for long-term data retention and digital preservation. S3 Glacier Deep Archive offers the lowest price of storage in AWS, and reliably stores any amount of data. It is the ideal storage class for customers who need to make archival, durable copies of data that rarely, if ever, need to be accessed.

With S3 Glacier Deep Archive, customers can eliminate the need for on-premises tape libraries, and no longer have to manage hardware refreshes and re-write data to new tapes as technologies evolve. All data stored in S3 Glacier Deep Archive can be retrieved within 12 hours.
Third, storage-only services, with short lock-in:

Storage-only Services
Service In Store Out Total Lock-in
Wasabi $2,495 $59,880 $2,495 $64,870 0.5
Backblaze B2 $2,504 $60,000 $12,504 $75,008 2.5

For details of how these numbers were computed, and the assumptions that make them low-ball estimates, see the appendix.

Any memory institution's digital collection would represent, in size and traffic and thus revenue, a rather small customer of any cloud service they would in practice select. The institution would have no negotiating leverage to counteract the service's lock-in. Thus a transition to cloud services needs to invest in minimizing the cost of switching vendors if the institution is not to be vulnerable to rent extraction.

Operational Lock-In

In addition to the lock-in costs designed-in to most cloud providers' pricing, there are additional lock-in costs arising from the process of switching providers:
  • Unless the institution retains an in-house copy (see "Reliability" below), there is unlikely to be on-premise storage capable of holding the data extracted from the old service using its tooling before it is uploaded to the new one using its tooling. A direct pipeline between the two services would be necessary, and the new service may well not have the necessary tooling in place.
  • Some changes to the applications that access the stored data, and to the processes by which sysadmins manage the cloud deployment, might be necessary (see APIs below).
  • If, as is likely, providers had to be switched at short notice additional costs would be incurred, for example from the need for temporary staff, and from the upper hand the new provider would have in negotiations. Google, at least, has a history of announcing the planned termination date for a service, and subsequently moving the date up significantly, disrupting customers transition plans (see "Reliability" below).
  • If, on the other hand, the timescale for a provider switch is extended, the institution will be paying both providers for an extended period.

Demand Spikes

Dating back to at least 2000's Blue Ribbon Task Force on Sustainable Digital Preservation and Access, it has been clear that the primary justification for preserving digital content is to have it be accessed, the more the better. At all institutional scales smaller than the Internet Archive, cloud is clearly superior to on-premise technology in this area.

Traditionally, access to archived content has been on an item-by-item basis. The somewhat sparse research on the historical patterns of access to archived data suggests that the probability per unit time of any individual item being accessed is very low, and much is for integrity checking. A small number of items would be extracted and exported from the archive's infrastructure. Analysis would be undertaken externally. Demand for on-premise bandwidth and computation was minimal; the fact that it was intermittent didn't involve significant infrastructure investments standing idle much of the time.

The advent of "data mining" is a revolution in access for two reasons:
  • Each scholar accessing the collection will access a substantial proportion of it in a short time, needing much higher bandwidth from storage than on-premise storage frequently provides, especially if it is implemented using tape.
  • Because each scholar's access requires high-bandwidth access to the large amounts of data, external analysis isn't cost-effective. Substantial compute resources must be co-located with the collection's storage. Scholars' accesses are intermittent. Either these resources will be idle much of the time, or scholars will be frustrated by the need to queue for access.
The British Library's BUDDAH program provides an excellent illustration of the potential of "big data" access to digital collections. Others can be found, for example, in Jennifer Oulette's Machine learning can offer new tools, fresh insights for the humanities.

The ability of cloud platforms to deliver "computation on demand" with "pay only for what you use" is an important enabler of "digital humanities" research. And it resolves the difficult issue of memory institutions charging for access to their collections. Researchers can be given a (free) key they can use for access to the collection; they can pay the cloud platform directly for the compute and bandwidth resources they need to analyze the data from research funds. The memory institution would not be involved in payments at all. This is the model behind Amazon's Public Datasets program, which stores access copies of important, freely-available datasets for free, making money from the computation and bandwidth charges it levies on researchers accessing them.

Rhizome daily emulations
If the institution is involved in paying cloud access charges they need to mitigate a serious risk to the budget. Rhizome's collection of the Theresa Duncan CD-ROMs illustrates the problem. They made this important collection of late 1990s feminist games freely available, via emulations hosted at Amazon. They received significant press coverage and a consequent huge spike in usage, which satisfied Rhizome's mission but quickly threatened their budget. The team had rapidly to implement a queuing system to throttle demand to what they could afford to service. Eager visitors faced a frustrating user experience, and some were lost. This replaced the natural throttling imposed by limited on-premise resources, but in a less understandable way.

The important lesson here is institution's need for budget predictability. On-premise costs are fairly predictable. Simple cloud implementations of access, such as those resulting from minimal porting of on-premise implementations, have unpredictable costs. Planning for a transition to cloud technology needs to anticipate significant implementation work devoted to avoiding unexpected cost spikes.

Reliability

All major cloud storage services claim 11 nines of "durability" or more. Unfortunately, as discussed at length in What Does Data "Durability" Mean?:
  • These claims represent design goals, not reliability delivered to customers.
  • The claims are based on models that include only about one-third of the causes of data loss in practice. 
  • Given the scale and long time horizons of national digital collections, even if the service delivered 11 nines of "durability" it would likely not be enough to prevent data loss. For details, see my series of posts on A Petabyte For A Century.
If claims of 11 nines of "durability" are just marketing hype, what can be said about achieving reliability adequate to preserve a national digital collection?

In What Does 11 Nines of Durability Really Mean? David Friend of Wasabi writes:
No amount of nines can prevent data loss.

There is one very important and inconvenient truth about reliability: Two-thirds of all data loss has nothing to do with hardware failure.

The real culprits are a combination of human error, viruses, bugs in application software, and malicious employees or intruders. Almost everyone has accidentally erased or overwritten a file. Even if your cloud storage had one million nines of durability, it can't protect you from human error.
Friend may be right that these are the top 5 causes of data loss, but over the timescale we are concerned with they are far from the only ones. In Requirements for Digital Preservation Systems: A Bottom-Up Approach we listed 13 of them.

It is reasonable to assume that the major services providing triple geo-redundancy such as Amazon, Google and Microsoft deliver more reliability than smaller services such as Wasabi and Backblaze. But they are a lot more expensive, thus more vulnerable to interruptions of the money supply. In Backblaze Durability is 99.999999999% - And Why It Doesn't Matter Brian Wilson wrote:
Any vendor selling cloud storage relies on billing its customers. If a customer stops paying, after some grace period, the vendor will delete the data to free up space for a paying customer.

Some customers pay by credit card. We don't have the math behind it, but we believe there's a greater than 1 in a million chance that the following events could occur:
  • You change your credit card provider. The credit card on file is invalid when the vendor tries to bill it.
  • Your email service provider thinks billing emails are SPAM. You don't see the emails coming from your vendor saying there is a problem.
  • You do not answer phone calls from numbers you do not recognize; Customer Support is trying to call you from a blocked number; they are trying to leave voicemails but the mailbox is full.
If all those things are true, it's possible that your data gets deleted simply because the system is operating as designed.
The risk of entire accounts being deleted due to billing problems is not hypothetical, as illustrated by this account of a near-disaster to a mission-critical application in Google's cloud:
"We will delete your project unless the billing owner corrects the violation by filling out the Account Verification Form within three business days. This form verifies your identity and ownership of the payment instrument. Failure to provide the requested documents may result in permanent account closure."
See also Lauren Weinstein's discussion of the reasons for the demise of G+:
Google knows that as time goes on their traditional advertising revenue model will become decreasingly effective. This is obviously one reason why they've been pivoting toward paid service models aimed at businesses and other organizations. That doesn't just include G Suite, but great products like their AI offerings, Google Cloud, and more.

But no matter how technically advanced those products, there's a fundamental question that any potential paying user of them must ask themselves. Can I depend on these services still being available a year from now? Or in five years? How do I know that Google won't treat business users the same ways as they've treated their consumer users?

In fact, sadly, I hear this all the time now. Users tell me that they had been planning to move their business services to Google, but after what they've seen happening on the consumer side they just don't trust Google to be a reliable partner going forward.
And the process by which G+ is being decommissioned:
We already know about Google's incredible user trust failure in announcing dates for this process. First it was August. Then suddenly it was April. The G+ APIs (which vast numbers of web sites - including mine - made the mistake of deeply embedding into their sites, we're told will start "intermittently failing" (whatever that actually means) later this month.

It gets much worse though. While Google has tools for users to download their own G+ postings for preservation, they have as far as I know provided nothing to help loyal G+ users maintain their social contacts - the array of other G+ followers and users with whom many of us have built up friendships on G+ over the years.
Threats such as these, and the likelihood of monoculture vulnerabilities in each service's software, are a strong argument against putting all your eggs in one basket. Two strategies seem appropriate:
  • Using multiple, individually less reliable but cheaper, cloud services in the expectation that billing problems, economic failures, external & internal attacks or catastrophic software failures are unlikely to affect multiple services simultaneously. But note the cost and time involved in recovering from a failure at one service by copying the entire collection out of another service into the failed service's replacement.
  • Maintaining an archival copy in-house, with the copy to be accessed by users in a single cloud service. The in-house copy is used only to recover from cloud service failure; it can be air-gapped for better security, and can provide better hardware and media diversity by using optical media to mitigate the risk of electromagnetic pulse (cf. Facebook's cold storage), or less usefully, tape (as recommended by Spectralogic, not an unbiased source). Panasonic has the optical media jukebox technology in production, IBM is working on similar technology.
Since the cheaper cloud storage services do not provide the compute platform needed to enable effective access, including data mining, the second strategy seems preferable. Access can be provided from the cloud copy using automatic scaling services such as Amazon's Elastic Beanstalk. If it becomes necessary to change cloud providers, the new one can be provisioned from the in-house copy without incurring egress charges, thus mitigating lock-in. In the unlikely event that the in-house copy is lost or damaged, it can be repaired from the cloud copy at the cost of incurring egress charges.

Integrity Verification

Verifying the integrity of data stored in a cloud service without trusting the service to some extent is a difficult problem to which no wholly satisfactory solution has been published. Asking the service for the hash of the data requires trusting the service to re-compute the hash at the time of the request rather than simply remembering the hash it initially computed. Hashing is expensive; the service has an incentive to avoid doing it.

The fundamental difficulty is that "the proof of the pudding is in the eating" - implementing a "Proof of Retrievability" normally requires actually extracting the data and hashing it in a trusted environment. To be trusted the environment must be outside the service, and thus egress charges will be incurred.

Shah et al's Auditing to Keep Online Storage Services Honest shows how the impact of egress charges can be reduced by a factor of M. Before storing the data D they compute an array of M+1 challenges C[i] = hash(N[i],D) where N[i] is an array of M+1 random nonces. The i-th integrity check is performed by sending the service N[i] and requesting hash(N[i],D) to compare with the C[i] the customer remembered. There is no need to trust the service because, before the request, the service doesn't know N[i], so it has to hash the whole of the data at the time of the request. Once every M integrity verifications, the entire data must be extracted, incurring egress charges, so that the data can be checked and a new set of M+1 challenges computed in a trusted environment.

See also my review of RFC4810 on its tenth anniversary.

Security

The security environment of digital collections stored in the cloud is significantly different from that of collections stored on-premise. Cloud storage is necessarily exposed to the public internet, where on-premise storage can be air-gapped, or carefully firewalled, from it. The security of cloud storage depends on careful configuration, and empirical studies show that this is beyond the capacity of most cloud customers:
The average business has around 14 improperly configured IaaS instances running at any given time and roughly one in every 20 AWS S3 buckets are left wide open to the public internet.

These are among the grim figures rolled out Monday by researchers with McAfee, who say that security practice has not kept up with the rapid adoption of cloud services.
For examples see here, here, here, here, here, and here. The risk is generally of inappropriate read rather than write access, but this may be a major concern for collections, such as copyright deposit, encumbered with restrictions on public access, or embargoed.

It is often suggested that the security of preserved content exposed to the Internet can be enhanced by encryption at rest. The downside of doing so is that it introduces catastrophic failure modes. The single key that encrypts all the content could be lost or compromised, the key protecting the key management system that stores the individual keys for each content item could be lost or compromised, or the cryptosystem itself could be broken. This isn't to rule out the use of encryption. Certainly self-encrypting media should be used to prevent inappropriate read access by dumpster-divers. And the risk of key loss or compromise might be assessed as less than the protection they provide.

API

There are three types of API involved in cloud deployments.
  • There are the various operating system APIs that applications use. Cloud platforms generally allow customers to run their choice of operating system in the cloud's virtual machines, so it is likely that legacy on-premise applications can use the same operating system in the cloud.
  • There are two widely used but quite different APIs applications use to access stored data; the traditional POSIX file access AIP and the object access API defined by Amazon's S3 that is ubiquitous among cloud storage services. Legacy software almost certainly uses the POSIX API. Even naive ports from POSIX to S3 can be difficult, and they often have miserable performance. Doing better can involve significant restructuring of the code, which may not be open source and thus unavailable.
  • The APIs and user interfaces that sysadmins use to create and manage a cloud deployment can be significantly different from those in use on-premise. There will normally be a learning curve for sysadmins in a cloud transition, or when switching cloud providers.
Cloud platforms typically offer storage services that export the POSIX file API. For example, Amazon's Elastic Block Storage (EBS) supports file systems up to a maximum of 16TB each. But unlike S3 they are not designed for long-term storage. They aren't suitable for digital preservation because of their size limit, their claimed reliability at only 5 nines, their lack of triple geo-replication, and the fact that they are about twice as expensive:
  • EBS throughput optimized volumes cost $0.045 per GB/month against S3 standard at $0.023 per GB/month or less.
  • EBS cold volumes cost $0.025 per GB/month against S3 standard infrequent access at $0.0125 per GB/month.
Some cloud vendors also provide higher-level services, such as search, analytics or machine learning. Once transitioned to a cloud provider there is a significant temptation to exploit these capabilities, but doing so almost certainly increases vendor lock-in, because vendors typically use these services to differentiate their product from their competitors'. When switching providers there are both learning curve and porting overheads to consider.

Kryder's Law

For the three decades leading up to 2010 the price per byte of disk storage dropped 30-40%/year, following Kryder's Law, the disk analog of Moore's Law. Over time this fed into drops in the price of cloud storage services. Industry roadmaps for disks in the years before 2010 expected that the then-current technology, Perpendicular Magnetic Recording (PMR), would be rapidly displaced by Heat-Assisted Magnetic Recording (HAMR).

The transition from PMR to HAMR turned out to be far harder and more expensive than the industry expected, leading to an abrupt halt to disk price drops in 2010. In 2011 40% of the industry's capacity was destroyed in the Thai floods, and prices increased. Small quantities of HAMR drives are only now entering the market. Since the recovery from the Thai floods the Kryder rate has been closer to 10% than to the industry's perennially optimistic forecasts. For details, see The Medium-Term Prospects for Long-Term Storage Systems.

Cloud storage is, except for specialized archival services using tape and premium high-performance services using SSDs, implemented using "nearline disk". There is no real prospect of a return to 30-40% Kryder rates for disk, nor that SSDs will displace disk for bulk data. Thus planning for long-term storage, whether in the cloud or on-premise, should assume Kryder rates of 10-20%/year (to be conservative, at the low end of the range) and the corresponding rates for cloud storage.

Byron et al, Fig 2
The effect of varying Kryder rates on the net present value of the costs of storing data for the long term can be explored using my simple economic model, which was based on research with UC Santa Cruz' Center for Research in Storage Systems six years ago. James Byron, Darrell Long, and Ethan Miller's Using Simulation to Design Scalable and Cost-Efficient Archival Storage Systems (also here) last year reported on a vastly more sophisticated model developed at the Center. It includes both much more detailed data about, for example, electricity cost, and covers various media types including tape, optical, and SSDs. They conclude:
Developments in storage technology affect the long-term total cost of ownership for archival systems. We have designed a simulator for archival storage that compares the relative cost of different technologies in an archival system. We found that the growth rates of performance and capacity for different storage technologies predict the cost of using them in archival systems. Hard disks, which require more electricity than other storage devices, offer a competitive solution to tape archival systems, particularly if the archived data must be accessed frequently. Solid state drives, which are more expensive for archival storage than tape or hard disk in terms of capital cost, require less power while offering more throughput than other storage technologies. We observed that the slow pace of development for optical disc technology will cause disc-based archives to become more expensive than other technologies; however, optical disc will remain a viable archival technology if its capacity and throughput increase more rapidly than they have in the past. We observed that the long-term prospect for development varies for different types of technology. Hard disks will likely remain competitive with tape for archival storage systems for years to come notwithstanding the prospect that hard disk capacity will increase more slowly than it has in the past.
This is truly impressive work that deserves detailed study.

On-premise vs. Cloud Economics

Clearly, the major economic driver of cloud economics is economies of scale. The four major cloud vendors build data centers, buy hardware and communications in huge volumes, and can develop and deploy high levels of automation to maximize the productivity of their staff. Nevertheless, the advantage a major vendor has over a smaller competitor decreases as the size differential between decreases. On the other hand, smaller, storage-only services have an advantage in their singular focus on storage, where storage is only one of many products for the major vendors.

Storage-only services such as Wasabi and Backblaze appear to cost around $5/TB/month against around $20/TB/month for the major vendors, but 4x isn't an apples-to-apples comparison. The storage-only services don't provide triple geo-redundancy. A more realistic comparison is S3's 1-zone Infrequent Access product at $10/TB/month, or 2x, and it is a better product since it is integrated with a compute platform.

Amazon's margins on AWS overall are around 25%, and on storage probably higher. Backblaze and Wasabi almost certainly run on much lower margins, say 5%. So the cost comparison is probably around 1.5x. Backblaze stores around 750PB of data in rented data center space, so economies of scale are not a big factor at the Exabyte scale.

The Internet Archive's two copies of over 30PB of data are examples of extremely economical on-premise storage. They use custom-built servers and software, not in data center space, and provide a significantly lower level of reliability. Nevertheless, they provide relatively intensive access to both copies, sustaining about 40Gb/s outbound traffic 24/7 as one of the top 300 Web sites in the world.

It is very unlikely that more conservative institutions could approach the Internet Archive's cost per TB/month. They spend about $2.4M/yr on hardware, so storage and compute combined cost them around $3.33/TB/month. That is around 30% less than Backblaze's storage-only service, which also uses custom hardware and software, but in rented data center space, which likely accounts for much of the cost differential.

My guess is that more conservative institutions would need to operate at the 100PB scale in on-premise data center space before they could compete on raw cost with the storage-only services for preservation on nearline disk. The advantages of on-demand scaling are so large that institutions lacking the Internet Archive's audience cannot compete with the major cloud platforms for access, even ignoring the demands of "big data" access.

However, if we assume that access is provided from a major cloud platform, it isn't necessary for on-premise preservation storage to be on nearline disk. Facebook used two different technologies to implement their cold storage, which addresses similar requirements to preservation storage. I discussed the details in More on Facebook's Cold Storage; in summary the two technologies are mostly-powered-down nearline disk, and Blu-Ray optical disk robots. At Facebook's scale the cost savings of both are very significant, not because of media costs but rather the synergistic effects. Both can be housed in cheap warehouse space without air-conditioning, redundant power, or raised floors. The Blu-Ray system is the more interesting, and Panasonic has a version in production. Depending upon how many of the synergies can be harvested at institutional scale, it is quite possible that Panasonic's optical media technology could be a significantly cheaper on-premise preservation storage solution than cloud providers's services.

Jurisdiction and Liability

Source
Government shutdown causing information access problems by James A. Jacobs and James R. Jacobs documents the effect of the Trump government shutdown on access to globally important information:
Twitter and newspapers are buzzing with complaints about widespread problems with access to government information and data (see for example, Wall Street Journal (paywall), ZDNet News, Pew Center, Washington Post, Scientific American, TheVerge, and FedScoop to name but a few).
Matthew Green, a professor at Johns Hopkins, said "It's worrying that every single US cryptography standard is now unavailable to practitioners." He was responding to the fact that he could not get the documents he needed from the National Institute of Standards and Technology (NIST) or its branch, the Computer Security Resource Center (CSRC). The government shutdown is the direct cause of these problems.
They point out how this illustrates the importance of libraries collecting and preserving web-published information:
Regardless of who you (or your user communities) blame for the shutdown itself, this loss of access was entirely foreseeable and avoidable. It was foreseeable because it has happened before. It was avoidable because libraries can select, acquire, organize, and preserve these documents and provide access to them and services for them whether the government is open or shut-down.
But it also points out the power that even the few governments with a constitutional mandate for freedom of the press have over online content. Both the EU, with the "right to be forgotten" and the GDPR, and the US in the long-running dispute between the Dept. of Justice and Microsoft over e-mails on a server in Ireland, claim extra-territorial jurisdiction in cyberspace. As I wrote back in 2015:
the recent legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. The EU has also passed laws claiming extra-territorial jurisdiction over data, so is in a poor position to object to US claims.
The Microsoft case was eventually declared moot by the Supreme Court after passage of the CLOUD act, which:
asserts that U.S. data and communication companies must provide stored data for U.S. citizens on any server they own and operate when requested by warrant, but provides mechanisms for the companies or the courts to reject or challenge these if they believe the request violates the privacy rights of the foreign country the data is stored in.
Although the CLOUD act has some limitations, the precedent that the Dept. of Justice was seeking was in effect established.

Outsourcing to the cloud in a world of extra-territorial jurisdiction involves two kinds of risks:
  • The physical location of the servers on which the data is stored, as in the Microsoft case, becomes an issue. In some cases there are legal requirements in the cloud customer's jurisdiction that data be kept only on servers within that jurisdiction. Major cloud services such as Amazon allow customers to specify such restrictions. Presumably the goal of the legal requirements is to establish that the customer's jurisdiction controls access to the data, but in an extra-territorial world they don't guarantee this.
  • The venue for dispute resolution is also an issue. The cloud service's End User License Agreement is written by their lawyers to transfer all possible risk to the customer. Part of that typically involves specifying the jurisdiction under which disputes will be resolved (all major cloud vendors are US-based), and the process for resolving them. For example, the AWS Customer Agreement specifies US federal law governs, and enforces mandatory binding arbitration in the US (Section 13.5). This all means, in effect, that customers have no realistic recourse against bad behavior by their cloud vendor.
Of course, in practice no dispute with the major cloud vendors is possible because they uniformly disclaim all warranties and any liability for actually delivering the service for which the customer pays (see, for example, the AWS Customer Agreement Sections 10 and 11).

The most likely cause for a dispute with a cloud service vendor is when the vendor decides that the service upon which the customer depends is insufficiently profitable, and abandons it. Google in particular has a history of doing so, tracked by Le Monde's Google Memorial, le petit musée des projets Google abandonnés, the Google Graveyard, and the comments to this blog post.

Source
This is unlikely to be a problem with Amazon, for two reasons. First, they have become "too big to fail", since much of the US government's IT infrastructure depends on AWS, as does much of the UK government's. Second, AWS controls enough of the market to effectively set prices. In consequence it is extremely profitable:
The margins on AWS, averaging 24.75% over the last twelve quarters, are what enables Amazon to run the US retail business averaging under 3% margin and the international business averaging -3.7% margin over the same period.
It should be noted that part of the reason for AWS' profitability is that, like the other major cloud vendors, Amazon's tax-avoidance practices are extremely effective. Taxpayer funded institutions might run some risk to their reputation by becoming dependent upon a company that blatantly abuses the public purse in this way.

Arkivum

There may be at least one cloud service designed specifically for infrequently accessed archival data that does accept some  liability for delivering what its customers pay for, a British company called Arkivum. I say "may be" because since I last looked at them their web site has been redesigned so as to break all deep links to it (a bad sign from an archiving company) and I cannot find any useful information on the new site. But I have third-party sources and the Wayback Machine, so here goes.

In 2016, with the traditional jokey headline, Chris Mellor at The Register reported on Arkivum:
It's competing successfully with public cloud and large-scale on-premises tape libraries by offering escrow-based guaranteed storage in its cloud.

It was started up in 2011 as a way of productising a service technology developed at the University of Southampton. This features data kept on tape in three remote locations, one of them under an Escrow arrangement, a service licensing agreement (SLA), a 25-year data integrity guarantee, and an on-premises appliance to take in large amounts of data and provide a local cache. ... The Escrow system is there to demonstrate no lock-in to Arkivum and the ability to get your data back should anything untoward happen.
Their 2016 General Terms and Conditions provide SLAs for availability and access latency. There don't appear to be egress charges, but egress is limited:
The Storage Services are intended for storing archive data, which is data that is not frequently accessed or updated. Frequent access to data stored via the Storage Services will be subject to Arkivum's fair usage policies, which may vary from time to time. The current policy permits retrieval of up to 5% of the Customer's data either by number of files or data volume per month.
The design of Arkivum's product is of interest because it specifically addresses a number of concerns with the use of cloud storage services for archival data:
  • The escrow feature mitigates lock-in to some extent. Presumably the idea is that once the tapes have been retrieved from escrow, the data can be recovered from them. But as I understand it the data is encrypted and formatted by Arkivum's software, so this may be a non-trivial task.
  • The data integrity guarantee accepts some level of liability for delivering the service customers pay for.
  • The on-premise appliance provides a file-based rather than an object-based API for storing data to the service, and accessing data once it has been retrieved from the system. Access takes a request to the service, and a delay, and is subject to the "fair usage policies".
However, there are still concerns that it does not address, such as integrity verification. Presumably, the data integrity guarantee only applies after an attempt to retrieve data fails. At most 5% of the data can be accessed per month, thus in practice data integrity could be verified about once every two years. Data integrity without retrieving the data has to be taken on trust.

Appendix

The data for the lock-in tables comes from the services' published prices, as linked from the first column. Assumptions:
  • Selected products all claim at least 11 nines of "durability" . But note that these claims are dubious, and:
    only Backblaze reveals how they arrive at their number of nines, and that their methodology considers only hardware failures, and in fact only whole-drive failures.
  • Geo-replication, except for storage-only services. This partly explains the cost differential.
  • US East region (cheapest).
  • Object (blob) storage not file storage. 
  • Data transfer consists of 100MB objects.
  • No data transfer failures needing re-transmission.
  • No access even for integrity checks.
Update 11 Feb 19: the pricing data I used originally was slightly out-of-date and there were slight mistakes in the spreadsheet. These have been corrected.

Since it is clear that data egress charges can have a significant impact on not-for-profit customers, Amazon has a program called AWS Global Data Egress Waiver:
AWS customers are eligible for waiver of egress charges under this program if they:
  • Work in academic or research institutions.
  • Run any research workloads or academic workloads. However, a few data-egress-as-a-service type applications are not allowed under this program, such as massively online open courseware (MOOC), media streaming services, and commercial, non-academic web hosting (web hosting that is part of the normal workings of a university is allowed, like departmental websites or enterprise workloads).
  • Route at least 80% of their Data Egress out of the AWS Cloud through an approved National Research and Education (NREN) network, such as Internet2, ESnet, GEANT, Janet, SingAREN, SINET, AARNet, and CANARIE. Most research institutions use these government-funded, dedicated networks to connect to AWS, while realizing higher network performance, better bandwidth, and stability.
  • Use institutional e-mail addresses for AWS accounts.
  • Work in an approved AWS Region.
Note that:
The maximum discount is 15% of total monthly spending on AWS services, which is several times the usage we typically see among our research customers.
which means that while this program may have some effect on the budget impact of demand spikes, it has little if any effect on lock-in. University research libraries would presumably qualify for this program; it isn't clear that national libraries would. The Global Data Egress Waiver is not included in the numbers in the tables.

19 comments:

David. said...

The latest Digital Storage Technology Newsletter projects that the average Kryder rate for hard disk over the decade from 2012 to 2022 will be around 20%/yr. Industry projections have a history of optimism.

David. said...

Larry Dignan's survey Top cloud providers 2019: AWS, Microsoft Azure, Google Cloud; IBM makes hybrid move; Salesforce dominates SaaS make interesting points about how difficult it is, based on their financial reports, to compare the big cloud providers:

"The top cloud providers for 2019 have maintained their positions, but the themes, strategies, and approaches to the market are all in flux. The infrastructure-as-a-service wars have been largely decided, with the spoils going to Amazon Web Services, Microsoft Azure, and Google Cloud Platform, but new technologies such as artificial intelligence and machine learning have opened the field up to other players.

Meanwhile, the cloud computing market in 2019 will have a decidedly multi-cloud spin, as the hybrid shift by players such as IBM, which is acquiring Red Hat, could change the landscape."

David. said...

The pricing data I used originally for the tables was slightly out-of-date, and there were slight mistakes in the spreadsheet. I have corrected the tables.

Unknown said...

The use of a single cloud service vendor does incur significant vendor lock-in.
It would be an interesting exercise to examine the simultaneous use of multiple cloud vendors as part of a long-term preservation strategy. This would permit retrieval based on several criteria including total cost, egress speed, and compute locality.
A robust ingest workflow tool and independent inventory registry is critical to the viability of such a solution.
In regards to vendor capriciousness, simultaneous storage in multiple vendor platforms permits selective abandonment without loss of content.

- David Pcolar

David. said...

The spreadsheet I used to generate the tables is available to view.

David. said...

To the point about reputational risk, the Institute on Taxation and Economic Policy's report Amazon in Its Prime: Doubles Profits, Pays $0 in Federal Income Taxes reveals that:

"Amazon ... nearly doubled its profits to $11.2 billion in 2018 from $5.6 billion the previous year and, once again, didn’t pay a single cent of federal income taxes.

The company’s newest corporate filing reveals that, far from paying the statutory 21 percent income tax rate on its U.S. income in 2018, Amazon reported a federal income tax rebate of $129 million. For those who don’t have a pocket calculator handy, that works out to a tax rate of negative 1 percent."

Their report from last year revealed:

"The online retail giant has built its business model on tax avoidance, and its latest financial filing makes it clear that Amazon continues to be insulated from the nation’s tax system. In 2017, Amazon reported $5.6 billion of U.S. profits and didn’t pay a dime of federal income taxes on it. The company’s financial statement suggests that various tax credits and tax breaks for executive stock options are responsible for zeroing out the company’s tax this year.

The company’s zero percent rate in 2017 reflects a longer term trend. During the previous five years, Amazon reported U.S. profits of $8.2 billion and paid an effective federal income tax rate of just 11.4 percent. This means the company was able to shelter more than two-thirds of its profits from tax during that five year period."

SO over the last seven years the US federal taxpayer has, in effect, subsidized Amazon to the tune of $4.3 billion. Looked at another way, 17.4% of Amazon's profits are extracted form the federal taxpayer. Not to mention the vast amounts Amazon extracts from state and local taxpayers, which led to the revolt by New Yorkers against the $3 billion subsidy for one of their additional "headquarters".

David. said...

More on the reputational risk Amazon poses from Zephyr Teachout.

David. said...

It isn't just Amazon gouging the taxpayer, Google reaped millions in tax breaks as it secretly expanded its real estate footprint across the U.S. by Elizabeth Dwoskin reports that:

"Last May, officials in Midlothian, Tex., a city near Dallas, approved more than $10 million in tax breaks for a huge, mysterious new development across from a shuttered Toys R Us warehouse.

That day was the first time officials had spoken publicly about an enigmatic developer’s plans to build a sprawling data center. The developer, which incorporated with the state four months earlier, went by the name Sharka LLC. City officials declined at the time to say who was behind Sharka.

The mystery company was Google — a fact the city revealed two months later, after the project was formally approved. Larry Barnett, president of Midlothian Economic Development, one of the agencies that negotiated the data center deal, said he knew at the time the tech giant was the one seeking a decade of tax giveaways for the project, but he was prohibited from disclosing it because the company had demanded secrecy."

David. said...

Paul Buchheit's How a failing capitalist system is allowing Amazon to cripple America and Barry Ritholtz' HQ2: Understanding What Happened & Why both provide useful background on the way the shifting political winds are increasing the reputational risk of getting locked in to Amazon.

David. said...

According to Bloomberg, the EU is just starting to freak out about the implications of the extra-territorial CLOUD act:

"As the US pushes ahead with the "Cloud Act" it enacted about a year ago, Europe is scrambling to curb its reach. Under the act, all US cloud service providers, from Microsoft and IBM to Amazon - when ordered - have to provide American authorities with data stored on their servers, regardless of where it's housed. With those providers controlling much of the cloud market in Europe, the act could potentially give the US the right to access information on large swaths of the region's people and companies.

The US says the act is aimed at aiding investigations. But some people are drawing parallels between the legislation and the National Intelligence Law that China put in place in 2017 requiring all its organisations and citizens to assist authorities with access to information. The Chinese law, which the US says is a tool for espionage, is cited by President Donald Trump's administration as a reason to avoid doing business with companies like Huawei Technologies."

David. said...

Jamie Powell's Amazon won't spin-off Amazon Web Services is a fascinating discussion of the symbiosis between the retail side of Amazon, with low margins but huge cash flow, and the high margins but cash flow negative AWS side:

"While we don't know the capital expenditure mix between retail infrastructure and AWS, it is not outrageous to suggest that Amazon's $13.4bn capex spend in 2018 was driven by AWS. Indeed, the acceleration in capital expenditure since 2010 is remarkable. In the nine financial years from 2001 to 2009, Amazon's capex grew at an annual average of 25 per cent. In the past nine years, this growth more than doubled to 57 per cent, according to S&P Capital IQ."

David. said...

Sean Keane's MySpace reportedly loses 50 million songs uploaded over 12 years is a cautionary tale about outsourced storage:

"Andy Baio, one of the people behind Kickstarter, tweeted that it could mean millions of songs uploaded between the site's Aug. 1, 2003 launch and 2015 are gone for good.

"Myspace accidentally lost all the music uploaded from its first 12 years in a server migration, losing over 50 million songs from 14 million artists," he wrote Sunday.

"I'm deeply skeptical this was an accident. Flagrant incompetence may be bad PR, but it still sounds better than 'we can't be bothered with the effort and cost of migrating and hosting 50 million old MP3s,' " Baio noted."

David. said...

Jamie Powell's The fragility of our digital selves is an apt comment on the MySpace "server migration":

"Today we've switched permanence for convenience. There's never been an easier time to create personal content, and in tune with Jevons' paradox, we've obliged: endlessly uploading videos to YouTube, photos to the Facebook-family of businesses, and our thoughts to Google and Apple. The mechanical representations of our lives are, for all intents and purposes, now locked in the cloud.

The provider of these services carry with them the same risk other businesses do: they make errors, they fall out of favour, they go bankrupt. Despite the perception of the tech giant's immutable power, history is littered with failed companies which once held apparent “monopoly” power."

David. said...

the risks of outsourcing your infrastructure are aptly illustrated by Thomas Claiburn's DigitalOcean drowned my startup! 'We lost everything, our servers, and one year of database backups' says biz boss:

"Two days ago, as Beauvais tells it, the startup's cloud provider, DigitalOcean, decided that a Python script the company uses periodically to makes its data easier to process was malicious. So DigitalOcean locked the company's account, which represents the entire IT infrastructure of the biz – five droplets for its web app, workers, cache and databases.

Beauvais, in a series of Twitter posts, describes sending multiple emails and Twitter direct messages to DigitalOcean and regaining access after 12 hours of downtime.

"We had to restart our data pipeline from the start as all the droplets were shut down and the Redis memory, that kept track of our advancement got wipe," he wrote. "Only four hours later our account got locked again, probably by the same automated script."

He says he then sent four messages over the next 30 or hours to support, only to receive an automated reply that DigitalOcean had declined to re-activate the account."

David. said...

Amazon's Summary of the Amazon EC2 Issues in the Asia Pacific (Tokyo) Region (AP-NORTHEAST-1) in English Japanese and Korean is an excellent example of their commendable transparency about issues that arise in the operation of AWS.

David. said...

On the other hand, Amazon's transparency and performance in sustainability leave much to be desired. ClimateAction.tech has started a project to remedy this by getting AWS customers to complain:

"In 2014, Amazon announced that it would power its data centers with 100% renewable energy and made recent announcement in April, of 3 new renewable energy projects, after a 3 year gap. However, Amazon’s continued failure to put a date on their end goal is a problem as there is no guarantee whether the on-going data center expansion won’t also be driving new investments in fossil fuels.
...
The lack of transparency also sets Amazon apart from other tech giants, who have already achieved their targets of 100% clean energy. Both Google and Apple reached this goal in 2018 and have independent audits in place to give their customers and shareholders confidence in what is meant by 100% renewable."

David. said...

Dan Geer and Wade Baker's For Good Measure: Is the Cloud Less Secure than On-Prem? is essential reading. They analyze data from RiskRecon covering 18,000 organizations, 5,000,000 hosts and 32,000,000 security findings of varying severity. For example, one of their findings is "a statistically significant but very low positive correlation" between the rate of high and critical security findings in the cloud and the percentage of an organization's hosts in the cloud.

David. said...

Shaun Nichols' Why do cloud leaks keep happening? Because no one has a clue how their instances are configured starts:

"The ongoing rash of data leaks caused by misconfigured clouds is the result of companies having virtually no visibility into how their cloud instances are configured, and very little ability to audit and manage them.

This less-than-sunny news comes courtesy of the team at McAfee, which said in its latest Infrastructure as a Service (IaaS) risk report that 99 per cent of exposed instances go unnoticed by the enterprises running them."

David. said...

Bank of America's CEO says that it's saved $2 billion per year by ignoring Amazon and Microsoft and building its own cloud instead by Alex Morrell and Dan DeFrancesco provides some details of the savings:

"Moynihan reminded analysts the bank took a $350 million charge in 2017 in part to execute the changeover to its private cloud.

But the results have been dramatic. The company once had 200,000 servers and roughly 60 data centers. Now, it's pared that down to 70,000 servers, of which 8,000 are handling the bulk of the load. And they've more than halved their data centers down to 23.

"We reduced expenses by basically around 40%, or $2 billion a year, on our backbone," Moynihan said, adding that they've simultaneously seen their transaction load balloon."