Executive SummaryTo summarize the following discussion, the following issues need to be considered when outsourcing long-term digital preservation to the cloud:
- The effect of egress charges on demand spikes and vendor lock-in.
- Whether the reliability the service advertises is adequate, and whether it is actually delivered.
- Whether integrity verification is possible and affordable.
- The differences in security environment between on-premise and cloud storage.
- Whether legacy applications can use the API of the cloud storage service.
- Whether Kryder's Law will continue to deliver rapid decreases in the cost of storage, and whether if it does they will be reflected in cloud storage service pricing.
- On-premise vs. cloud storage economics.
- The terms and conditions in the service's Customer Agreement, including the disclaimers of liability and the jurisdiction for dispute resolution.
Economic Vendor Lock-InUnlike access, preservation necessarily has a long time horizon. The Digital Preservation Network wanted its members to make a 20-year commitment, but they didn't. One of the important lessons they want the community to learn from its failure is:
While research universities and cultural heritage institutions are innately long-running, they operate on that implicitly rather than by making explicit long-term plans.Much of the attraction of cloud technology for organizations, especially public institutions funded through a government's annual budget process, is that they transfer costs from capital to operational expenditure. It is easy to believe that this increases financial flexibility. As regards ingest and dissemination, this may be true. Ingesting some items can be delayed to the next budget cycle, or the access rate limit lowered temporarily. But as regards preservation, it isn't true. It is unlikely that parts of the institution's collection can be de-accessioned in a budget crunch, only to be re-accessioned later when funds are adequate. Even were the content still available to be re-ingested, the cost of ingest is a significant fraction of the total life-cycle cost of preserving digital content.
Despite the difficulty of reconciling preservation's long time horizon with an annual budget process, the limited life of on-premise hardware provides regular opportunities to switch to lower-cost vendors. Customer acquisition is a significant cost for most cloud platforms; they need to reduce churn to a minimum. Lock-in tends to be an important aspect of cloud service business models.
To illustrate the effect of lock-in for preservation for three types of services, the tables show the cost of ingesting a Petabyte in a month over the network, then storing it for a year with no access, and then extracting it over the network in a month. The last column is the approximate number of months of storage cost that getting the Petabyte out in a month represents. First, services with associated compute platforms, whose lock-in periods vary:
|Compute Platform Services|
Second, "archival" storage services, typically with long lock-in:
Note that Amazon recently pre-announced an even colder and cheaper storage class than Glacier, Glacier Deep Archive:
S3 Glacier Deep Archive is a new Amazon S3 storage class that provides secure, durable object storage for long-term data retention and digital preservation. S3 Glacier Deep Archive offers the lowest price of storage in AWS, and reliably stores any amount of data. It is the ideal storage class for customers who need to make archival, durable copies of data that rarely, if ever, need to be accessed.Third, storage-only services, with short lock-in:
With S3 Glacier Deep Archive, customers can eliminate the need for on-premises tape libraries, and no longer have to manage hardware refreshes and re-write data to new tapes as technologies evolve. All data stored in S3 Glacier Deep Archive can be retrieved within 12 hours.
For details of how these numbers were computed, and the assumptions that make them low-ball estimates, see the appendix.
Any memory institution's digital collection would represent, in size and traffic and thus revenue, a rather small customer of any cloud service they would in practice select. The institution would have no negotiating leverage to counteract the service's lock-in. Thus a transition to cloud services needs to invest in minimizing the cost of switching vendors if the institution is not to be vulnerable to rent extraction.
Operational Lock-InIn addition to the lock-in costs designed-in to most cloud providers' pricing, there are additional lock-in costs arising from the process of switching providers:
- Unless the institution retains an in-house copy (see "Reliability" below), there is unlikely to be on-premise storage capable of holding the data extracted from the old service using its tooling before it is uploaded to the new one using its tooling. A direct pipeline between the two services would be necessary, and the new service may well not have the necessary tooling in place.
- Some changes to the applications that access the stored data, and to the processes by which sysadmins manage the cloud deployment, might be necessary (see APIs below).
- If, as is likely, providers had to be switched at short notice additional costs would be incurred, for example from the need for temporary staff, and from the upper hand the new provider would have in negotiations. Google, at least, has a history of announcing the planned termination date for a service, and subsequently moving the date up significantly, disrupting customers transition plans (see "Reliability" below).
- If, on the other hand, the timescale for a provider switch is extended, the institution will be paying both providers for an extended period.
Demand SpikesDating back to at least 2000's Blue Ribbon Task Force on Sustainable Digital Preservation and Access, it has been clear that the primary justification for preserving digital content is to have it be accessed, the more the better. At all institutional scales smaller than the Internet Archive, cloud is clearly superior to on-premise technology in this area.
Traditionally, access to archived content has been on an item-by-item basis. The somewhat sparse research on the historical patterns of access to archived data suggests that the probability per unit time of any individual item being accessed is very low, and much is for integrity checking. A small number of items would be extracted and exported from the archive's infrastructure. Analysis would be undertaken externally. Demand for on-premise bandwidth and computation was minimal; the fact that it was intermittent didn't involve significant infrastructure investments standing idle much of the time.
The advent of "data mining" is a revolution in access for two reasons:
- Each scholar accessing the collection will access a substantial proportion of it in a short time, needing much higher bandwidth from storage than on-premise storage frequently provides, especially if it is implemented using tape.
- Because each scholar's access requires high-bandwidth access to the large amounts of data, external analysis isn't cost-effective. Substantial compute resources must be co-located with the collection's storage. Scholars' accesses are intermittent. Either these resources will be idle much of the time, or scholars will be frustrated by the need to queue for access.
The ability of cloud platforms to deliver "computation on demand" with "pay only for what you use" is an important enabler of "digital humanities" research. And it resolves the difficult issue of memory institutions charging for access to their collections. Researchers can be given a (free) key they can use for access to the collection; they can pay the cloud platform directly for the compute and bandwidth resources they need to analyze the data from research funds. The memory institution would not be involved in payments at all. This is the model behind Amazon's Public Datasets program, which stores access copies of important, freely-available datasets for free, making money from the computation and bandwidth charges it levies on researchers accessing them.
|Rhizome daily emulations|
The important lesson here is institution's need for budget predictability. On-premise costs are fairly predictable. Simple cloud implementations of access, such as those resulting from minimal porting of on-premise implementations, have unpredictable costs. Planning for a transition to cloud technology needs to anticipate significant implementation work devoted to avoiding unexpected cost spikes.
ReliabilityAll major cloud storage services claim 11 nines of "durability" or more. Unfortunately, as discussed at length in What Does Data "Durability" Mean?:
- These claims represent design goals, not reliability delivered to customers.
- The claims are based on models that include only about one-third of the causes of data loss in practice.
- Given the scale and long time horizons of national digital collections, even if the service delivered 11 nines of "durability" it would likely not be enough to prevent data loss. For details, see my series of posts on A Petabyte For A Century.
What Does 11 Nines of Durability Really Mean? David Friend of Wasabi writes:
No amount of nines can prevent data loss.Friend may be right that these are the top 5 causes of data loss, but over the timescale we are concerned with they are far from the only ones. In Requirements for Digital Preservation Systems: A Bottom-Up Approach we listed 13 of them.
There is one very important and inconvenient truth about reliability: Two-thirds of all data loss has nothing to do with hardware failure.
The real culprits are a combination of human error, viruses, bugs in application software, and malicious employees or intruders. Almost everyone has accidentally erased or overwritten a file. Even if your cloud storage had one million nines of durability, it can't protect you from human error.
It is reasonable to assume that the major services providing triple geo-redundancy such as Amazon, Google and Microsoft deliver more reliability than smaller services such as Wasabi and Backblaze. But they are a lot more expensive, thus more vulnerable to interruptions of the money supply. In Backblaze Durability is 99.999999999% - And Why It Doesn't Matter Brian Wilson wrote:
Any vendor selling cloud storage relies on billing its customers. If a customer stops paying, after some grace period, the vendor will delete the data to free up space for a paying customer.The risk of entire accounts being deleted due to billing problems is not hypothetical, as illustrated by this account of a near-disaster to a mission-critical application in Google's cloud:
Some customers pay by credit card. We don't have the math behind it, but we believe there's a greater than 1 in a million chance that the following events could occur:
If all those things are true, it's possible that your data gets deleted simply because the system is operating as designed.
- You change your credit card provider. The credit card on file is invalid when the vendor tries to bill it.
- Your email service provider thinks billing emails are SPAM. You don't see the emails coming from your vendor saying there is a problem.
- You do not answer phone calls from numbers you do not recognize; Customer Support is trying to call you from a blocked number; they are trying to leave voicemails but the mailbox is full.
"We will delete your project unless the billing owner corrects the violation by filling out the Account Verification Form within three business days. This form verifies your identity and ownership of the payment instrument. Failure to provide the requested documents may result in permanent account closure."See also Lauren Weinstein's discussion of the reasons for the demise of G+:
Google knows that as time goes on their traditional advertising revenue model will become decreasingly effective. This is obviously one reason why they've been pivoting toward paid service models aimed at businesses and other organizations. That doesn't just include G Suite, but great products like their AI offerings, Google Cloud, and more.And the process by which G+ is being decommissioned:
But no matter how technically advanced those products, there's a fundamental question that any potential paying user of them must ask themselves. Can I depend on these services still being available a year from now? Or in five years? How do I know that Google won't treat business users the same ways as they've treated their consumer users?
In fact, sadly, I hear this all the time now. Users tell me that they had been planning to move their business services to Google, but after what they've seen happening on the consumer side they just don't trust Google to be a reliable partner going forward.
We already know about Google's incredible user trust failure in announcing dates for this process. First it was August. Then suddenly it was April. The G+ APIs (which vast numbers of web sites - including mine - made the mistake of deeply embedding into their sites, we're told will start "intermittently failing" (whatever that actually means) later this month.Threats such as these, and the likelihood of monoculture vulnerabilities in each service's software, are a strong argument against putting all your eggs in one basket. Two strategies seem appropriate:
It gets much worse though. While Google has tools for users to download their own G+ postings for preservation, they have as far as I know provided nothing to help loyal G+ users maintain their social contacts - the array of other G+ followers and users with whom many of us have built up friendships on G+ over the years.
- Using multiple, individually less reliable but cheaper, cloud services in the expectation that billing problems, economic failures, external & internal attacks or catastrophic software failures are unlikely to affect multiple services simultaneously. But note the cost and time involved in recovering from a failure at one service by copying the entire collection out of another service into the failed service's replacement.
- Maintaining an archival copy in-house, with the copy to be accessed by users in a single cloud service. The in-house copy is used only to recover from cloud service failure; it can be air-gapped for better security, and can provide better hardware and media diversity by using optical media to mitigate the risk of electromagnetic pulse (cf. Facebook's cold storage), or less usefully, tape (as recommended by Spectralogic, not an unbiased source). Panasonic has the optical media jukebox technology in production, IBM is working on similar technology.
Integrity VerificationVerifying the integrity of data stored in a cloud service without trusting the service to some extent is a difficult problem to which no wholly satisfactory solution has been published. Asking the service for the hash of the data requires trusting the service to re-compute the hash at the time of the request rather than simply remembering the hash it initially computed. Hashing is expensive; the service has an incentive to avoid doing it.
The fundamental difficulty is that "the proof of the pudding is in the eating" - implementing a "Proof of Retrievability" normally requires actually extracting the data and hashing it in a trusted environment. To be trusted the environment must be outside the service, and thus egress charges will be incurred.
Shah et al's Auditing to Keep Online Storage Services Honest shows how the impact of egress charges can be reduced by a factor of M. Before storing the data D they compute an array of M+1 challenges C[i] = hash(N[i],D) where N[i] is an array of M+1 random nonces. The i-th integrity check is performed by sending the service N[i] and requesting hash(N[i],D) to compare with the C[i] the customer remembered. There is no need to trust the service because, before the request, the service doesn't know N[i], so it has to hash the whole of the data at the time of the request. Once every M integrity verifications, the entire data must be extracted, incurring egress charges, so that the data can be checked and a new set of M+1 challenges computed in a trusted environment.
See also my review of RFC4810 on its tenth anniversary.
SecurityThe security environment of digital collections stored in the cloud is significantly different from that of collections stored on-premise. Cloud storage is necessarily exposed to the public internet, where on-premise storage can be air-gapped, or carefully firewalled, from it. The security of cloud storage depends on careful configuration, and empirical studies show that this is beyond the capacity of most cloud customers:
The average business has around 14 improperly configured IaaS instances running at any given time and roughly one in every 20 AWS S3 buckets are left wide open to the public internet.For examples see here, here, here, here, here, and here. The risk is generally of inappropriate read rather than write access, but this may be a major concern for collections, such as copyright deposit, encumbered with restrictions on public access, or embargoed.
These are among the grim figures rolled out Monday by researchers with McAfee, who say that security practice has not kept up with the rapid adoption of cloud services.
It is often suggested that the security of preserved content exposed to the Internet can be enhanced by encryption at rest. The downside of doing so is that it introduces catastrophic failure modes. The single key that encrypts all the content could be lost or compromised, the key protecting the key management system that stores the individual keys for each content item could be lost or compromised, or the cryptosystem itself could be broken. This isn't to rule out the use of encryption. Certainly self-encrypting media should be used to prevent inappropriate read access by dumpster-divers. And the risk of key loss or compromise might be assessed as less than the protection they provide.
APIThere are three types of API involved in cloud deployments.
- There are the various operating system APIs that applications use. Cloud platforms generally allow customers to run their choice of operating system in the cloud's virtual machines, so it is likely that legacy on-premise applications can use the same operating system in the cloud.
- There are two widely used but quite different APIs applications use to access stored data; the traditional POSIX file access AIP and the object access API defined by Amazon's S3 that is ubiquitous among cloud storage services. Legacy software almost certainly uses the POSIX API. Even naive ports from POSIX to S3 can be difficult, and they often have miserable performance. Doing better can involve significant restructuring of the code, which may not be open source and thus unavailable.
- The APIs and user interfaces that sysadmins use to create and manage a cloud deployment can be significantly different from those in use on-premise. There will normally be a learning curve for sysadmins in a cloud transition, or when switching cloud providers.
- EBS throughput optimized volumes cost $0.045 per GB/month against S3 standard at $0.023 per GB/month or less.
- EBS cold volumes cost $0.025 per GB/month against S3 standard infrequent access at $0.0125 per GB/month.
Kryder's LawFor the three decades leading up to 2010 the price per byte of disk storage dropped 30-40%/year, following Kryder's Law, the disk analog of Moore's Law. Over time this fed into drops in the price of cloud storage services. Industry roadmaps for disks in the years before 2010 expected that the then-current technology, Perpendicular Magnetic Recording (PMR), would be rapidly displaced by Heat-Assisted Magnetic Recording (HAMR).
The transition from PMR to HAMR turned out to be far harder and more expensive than the industry expected, leading to an abrupt halt to disk price drops in 2010. In 2011 40% of the industry's capacity was destroyed in the Thai floods, and prices increased. Small quantities of HAMR drives are only now entering the market. Since the recovery from the Thai floods the Kryder rate has been closer to 10% than to the industry's perennially optimistic forecasts. For details, see The Medium-Term Prospects for Long-Term Storage Systems.
Cloud storage is, except for specialized archival services using tape and premium high-performance services using SSDs, implemented using "nearline disk". There is no real prospect of a return to 30-40% Kryder rates for disk, nor that SSDs will displace disk for bulk data. Thus planning for long-term storage, whether in the cloud or on-premise, should assume Kryder rates of 10-20%/year (to be conservative, at the low end of the range) and the corresponding rates for cloud storage.
|Byron et al, Fig 2|
Developments in storage technology affect the long-term total cost of ownership for archival systems. We have designed a simulator for archival storage that compares the relative cost of different technologies in an archival system. We found that the growth rates of performance and capacity for different storage technologies predict the cost of using them in archival systems. Hard disks, which require more electricity than other storage devices, offer a competitive solution to tape archival systems, particularly if the archived data must be accessed frequently. Solid state drives, which are more expensive for archival storage than tape or hard disk in terms of capital cost, require less power while offering more throughput than other storage technologies. We observed that the slow pace of development for optical disc technology will cause disc-based archives to become more expensive than other technologies; however, optical disc will remain a viable archival technology if its capacity and throughput increase more rapidly than they have in the past. We observed that the long-term prospect for development varies for different types of technology. Hard disks will likely remain competitive with tape for archival storage systems for years to come notwithstanding the prospect that hard disk capacity will increase more slowly than it has in the past.This is truly impressive work that deserves detailed study.
On-premise vs. Cloud EconomicsClearly, the major economic driver of cloud economics is economies of scale. The four major cloud vendors build data centers, buy hardware and communications in huge volumes, and can develop and deploy high levels of automation to maximize the productivity of their staff. Nevertheless, the advantage a major vendor has over a smaller competitor decreases as the size differential between decreases. On the other hand, smaller, storage-only services have an advantage in their singular focus on storage, where storage is only one of many products for the major vendors.
Storage-only services such as Wasabi and Backblaze appear to cost around $5/TB/month against around $20/TB/month for the major vendors, but 4x isn't an apples-to-apples comparison. The storage-only services don't provide triple geo-redundancy. A more realistic comparison is S3's 1-zone Infrequent Access product at $10/TB/month, or 2x, and it is a better product since it is integrated with a compute platform.
Amazon's margins on AWS overall are around 25%, and on storage probably higher. Backblaze and Wasabi almost certainly run on much lower margins, say 5%. So the cost comparison is probably around 1.5x. Backblaze stores around 750PB of data in rented data center space, so economies of scale are not a big factor at the Exabyte scale.
The Internet Archive's two copies of over 30PB of data are examples of extremely economical on-premise storage. They use custom-built servers and software, not in data center space, and provide a significantly lower level of reliability. Nevertheless, they provide relatively intensive access to both copies, sustaining about 40Gb/s outbound traffic 24/7 as one of the top 300 Web sites in the world.
It is very unlikely that more conservative institutions could approach the Internet Archive's cost per TB/month. They spend about $2.4M/yr on hardware, so storage and compute combined cost them around $3.33/TB/month. That is around 30% less than Backblaze's storage-only service, which also uses custom hardware and software, but in rented data center space, which likely accounts for much of the cost differential.
My guess is that more conservative institutions would need to operate at the 100PB scale in on-premise data center space before they could compete on raw cost with the storage-only services for preservation on nearline disk. The advantages of on-demand scaling are so large that institutions lacking the Internet Archive's audience cannot compete with the major cloud platforms for access, even ignoring the demands of "big data" access.
However, if we assume that access is provided from a major cloud platform, it isn't necessary for on-premise preservation storage to be on nearline disk. Facebook used two different technologies to implement their cold storage, which addresses similar requirements to preservation storage. I discussed the details in More on Facebook's Cold Storage; in summary the two technologies are mostly-powered-down nearline disk, and Blu-Ray optical disk robots. At Facebook's scale the cost savings of both are very significant, not because of media costs but rather the synergistic effects. Both can be housed in cheap warehouse space without air-conditioning, redundant power, or raised floors. The Blu-Ray system is the more interesting, and Panasonic has a version in production. Depending upon how many of the synergies can be harvested at institutional scale, it is quite possible that Panasonic's optical media technology could be a significantly cheaper on-premise preservation storage solution than cloud providers's services.
Jurisdiction and Liability
Twitter and newspapers are buzzing with complaints about widespread problems with access to government information and data (see for example, Wall Street Journal (paywall), ZDNet News, Pew Center, Washington Post, Scientific American, TheVerge, and FedScoop to name but a few).They point out how this illustrates the importance of libraries collecting and preserving web-published information:
Matthew Green, a professor at Johns Hopkins, said "It's worrying that every single US cryptography standard is now unavailable to practitioners." He was responding to the fact that he could not get the documents he needed from the National Institute of Standards and Technology (NIST) or its branch, the Computer Security Resource Center (CSRC). The government shutdown is the direct cause of these problems.Maybe when/if the government opens again, we should scrape the NIST and CSRC websites, put all those publications somewhere public. It's worrying that *every single US cryptography standard* is now unavailable to practitioners.- Matthew Green (@matthew_d_green) January 12, 2019
Regardless of who you (or your user communities) blame for the shutdown itself, this loss of access was entirely foreseeable and avoidable. It was foreseeable because it has happened before. It was avoidable because libraries can select, acquire, organize, and preserve these documents and provide access to them and services for them whether the government is open or shut-down.But it also points out the power that even the few governments with a constitutional mandate for freedom of the press have over online content. Both the EU, with the "right to be forgotten" and the GDPR, and the US in the long-running dispute between the Dept. of Justice and Microsoft over e-mails on a server in Ireland, claim extra-territorial jurisdiction in cyberspace. As I wrote back in 2015:
the recent legal battle between the US Dept. of Justice and Microsoft over access to e-mails stored on a server in Ireland has made it clear that the US is determined to establish that content under the control of a company with a substantial connection to the USA be subject to US jurisdiction irrespective of where the content is stored. The EU has also passed laws claiming extra-territorial jurisdiction over data, so is in a poor position to object to US claims.The Microsoft case was eventually declared moot by the Supreme Court after passage of the CLOUD act, which:
asserts that U.S. data and communication companies must provide stored data for U.S. citizens on any server they own and operate when requested by warrant, but provides mechanisms for the companies or the courts to reject or challenge these if they believe the request violates the privacy rights of the foreign country the data is stored in.Although the CLOUD act has some limitations, the precedent that the Dept. of Justice was seeking was in effect established.
Outsourcing to the cloud in a world of extra-territorial jurisdiction involves two kinds of risks:
- The physical location of the servers on which the data is stored, as in the Microsoft case, becomes an issue. In some cases there are legal requirements in the cloud customer's jurisdiction that data be kept only on servers within that jurisdiction. Major cloud services such as Amazon allow customers to specify such restrictions. Presumably the goal of the legal requirements is to establish that the customer's jurisdiction controls access to the data, but in an extra-territorial world they don't guarantee this.
- The venue for dispute resolution is also an issue. The cloud service's End User License Agreement is written by their lawyers to transfer all possible risk to the customer. Part of that typically involves specifying the jurisdiction under which disputes will be resolved (all major cloud vendors are US-based), and the process for resolving them. For example, the AWS Customer Agreement specifies US federal law governs, and enforces mandatory binding arbitration in the US (Section 13.5). This all means, in effect, that customers have no realistic recourse against bad behavior by their cloud vendor.
The most likely cause for a dispute with a cloud service vendor is when the vendor decides that the service upon which the customer depends is insufficiently profitable, and abandons it. Google in particular has a history of doing so, tracked by Le Monde's Google Memorial, le petit musée des projets Google abandonnés, the Google Graveyard, and the comments to this blog post.
The margins on AWS, averaging 24.75% over the last twelve quarters, are what enables Amazon to run the US retail business averaging under 3% margin and the international business averaging -3.7% margin over the same period.It should be noted that part of the reason for AWS' profitability is that, like the other major cloud vendors, Amazon's tax-avoidance practices are extremely effective. Taxpayer funded institutions might run some risk to their reputation by becoming dependent upon a company that blatantly abuses the public purse in this way.
ArkivumThere may be at least one cloud service designed specifically for infrequently accessed archival data that does accept some liability for delivering what its customers pay for, a British company called Arkivum. I say "may be" because since I last looked at them their web site has been redesigned so as to break all deep links to it (a bad sign from an archiving company) and I cannot find any useful information on the new site. But I have third-party sources and the Wayback Machine, so here goes.
In 2016, with the traditional jokey headline, Chris Mellor at The Register reported on Arkivum:
It's competing successfully with public cloud and large-scale on-premises tape libraries by offering escrow-based guaranteed storage in its cloud.Their 2016 General Terms and Conditions provide SLAs for availability and access latency. There don't appear to be egress charges, but egress is limited:
It was started up in 2011 as a way of productising a service technology developed at the University of Southampton. This features data kept on tape in three remote locations, one of them under an Escrow arrangement, a service licensing agreement (SLA), a 25-year data integrity guarantee, and an on-premises appliance to take in large amounts of data and provide a local cache. ... The Escrow system is there to demonstrate no lock-in to Arkivum and the ability to get your data back should anything untoward happen.
The Storage Services are intended for storing archive data, which is data that is not frequently accessed or updated. Frequent access to data stored via the Storage Services will be subject to Arkivum's fair usage policies, which may vary from time to time. The current policy permits retrieval of up to 5% of the Customer's data either by number of files or data volume per month.The design of Arkivum's product is of interest because it specifically addresses a number of concerns with the use of cloud storage services for archival data:
- The escrow feature mitigates lock-in to some extent. Presumably the idea is that once the tapes have been retrieved from escrow, the data can be recovered from them. But as I understand it the data is encrypted and formatted by Arkivum's software, so this may be a non-trivial task.
- The data integrity guarantee accepts some level of liability for delivering the service customers pay for.
- The on-premise appliance provides a file-based rather than an object-based API for storing data to the service, and accessing data once it has been retrieved from the system. Access takes a request to the service, and a delay, and is subject to the "fair usage policies".
AppendixThe data for the lock-in tables comes from the services' published prices, as linked from the first column. Assumptions:
- Selected products all claim at least 11 nines of "durability" . But note that these claims are dubious, and:
only Backblaze reveals how they arrive at their number of nines, and that their methodology considers only hardware failures, and in fact only whole-drive failures.
- Geo-replication, except for storage-only services. This partly explains the cost differential.
- US East region (cheapest).
- Object (blob) storage not file storage.
- Data transfer consists of 100MB objects.
- No data transfer failures needing re-transmission.
- No access even for integrity checks.
Since it is clear that data egress charges can have a significant impact on not-for-profit customers, Amazon has a program called AWS Global Data Egress Waiver:
AWS customers are eligible for waiver of egress charges under this program if they:Note that:
- Work in academic or research institutions.
- Run any research workloads or academic workloads. However, a few data-egress-as-a-service type applications are not allowed under this program, such as massively online open courseware (MOOC), media streaming services, and commercial, non-academic web hosting (web hosting that is part of the normal workings of a university is allowed, like departmental websites or enterprise workloads).
- Route at least 80% of their Data Egress out of the AWS Cloud through an approved National Research and Education (NREN) network, such as Internet2, ESnet, GEANT, Janet, SingAREN, SINET, AARNet, and CANARIE. Most research institutions use these government-funded, dedicated networks to connect to AWS, while realizing higher network performance, better bandwidth, and stability.
- Use institutional e-mail addresses for AWS accounts.
- Work in an approved AWS Region.
The maximum discount is 15% of total monthly spending on AWS services, which is several times the usage we typically see among our research customers.which means that while this program may have some effect on the budget impact of demand spikes, it has little if any effect on lock-in. University research libraries would presumably qualify for this program; it isn't clear that national libraries would. The Global Data Egress Waiver is not included in the numbers in the tables.