Two recent developments provide alternative models:
- Last year, James Byron, Darrell Long, and Ethan Miller's Using Simulation to Design Scalable and Cost-Efficient Archival Storage Systems (also here) reported on a vastly more sophisticated model developed at the Center. It includes both much more detailed historical data about, for example, electricity cost, and covers various media types including tape, optical, and SSDs.
- At the recent PASIG Julian Morley reported on the model being used at the Stanford Digital Repository, a hybrid local and cloud system, and he has made the spreadsheet available for use.
|Table 1: ARCHIVE PARAMETERS
|Data Read Yearly
|Data Modified Yearly
|Data Scrubbed Yearly
|Table 7: HDD capacity (modified)
|CAGR since 2005
The capacity of new tapes, HDDs, and SSDs grows by more than 30% each year in these simulations, and therefore the required number of new storage devices each year decreases as the rate of growth for capacity outpaces the growth of the data in the archive.But this is misleading, as at least for hard disk the Kryder rate has slowed dramatically. Their Table 7 disguises this in three ways, by showing the CAGR of capacity since 2005, well before the slowing started, by using the announcement date rather than the volume ship date, and by omitting rows for 2013 and 2015, during the slowdown. Note that their citation for 14TB in 2017 is an October announcement; by the end of 2018 Backblaze, a volume purchaser, was just starting to deploy 14TB drives. I have taken the liberty of inserting in italic capacities for the missing years assuming that the slowdown had not happened. The dramatic effect can be seen by the fact that if the earlier Kryder rate had continued, 32TB drives would be shipping this year. Instead we have just reached volume shipping of 14TB drives.
Assuming hard disk Kryder rates will accelerate back to pre-2010 rates is implausible. Their table 9 asserts that high-capacity hard disk will reach a limit of 100TB/drive in 7 years 18 years before the end of their simulation. This is about a 32% Kryder rate, but it is based on industry "roadmaps" which have a long history of Panglossian optimism.
|Byron et al Figure 12
StanfordVia e-mail, Julian Morley provided me with a lot of background information to assist in interpreting the spreadsheet. I've tried to incorporate much of it in this section.
Stanford's policy is to hold the primary copy used for both access and preservation locally, and to maintain copies in each of three cloud archive services. They perform integrity checks very frequently on the local copy, but do not perform them on the cloud copies. The reason is that access charges for each of the archive services make it cheaper to maintain three disparate archive copies than to perform regular integrity checks on a single archive copy. The bet is that loss or damage would be very unlikely to affect the same data in three different vendors' archive storage simultaneously. If damage occurs to the local copy it can be repaired from one of the archive copies, although they have yet to need to do so.
The spreadsheet's "cloud style" cost for the local copy each year ranges from $0.022 to $0.007/GB/month, averaging over the 8 years $0.0115/GB/month, which it says to "Compare to S3 Standard or Infrequent Access pricing." At Morley's scale S3 Standard is currently $0.021, and S3 IA is $0.0125, which makes the local copy look good. But this is the wrong comparison. S3 and S3 IA provide triple geo-redundancy, Morley's local copy does not. The correct comparison is with S3 One-zone IA, currently at $0.010/GB/month. The local copy is slightly more expensive in storage costs alone, but:
- The local copy does not incur per-request or egress charges. In practice, especially with Stanford's requirement for very frequent integrity checks, makes it significantly cheaper.
- If the access and preservation copy were in S3 One-zone IA, economics would force access processing, integrity checking and collection management to be in AWS, to avoid egress charges. The system would be effectively locked-in to AWS, something Stanford is anxious to avoid. Amazon has a program called AWS Global Data Egress Waiver for which Stanford would qualify, but Stanford's frequent integrity checks would exceed its limits by a large factor.
|CLOUD SPEND PROPORTION
MineMy model differs from the Santa Cruz and Stanford models in two important respects:
- The other models compute annual expenditures incurred by a growing collection; mine computes the cost to store a fixed-size collection.
- The other models compute expenditures in current dollars at the time they are spent; mine considers the time value of money to compute the net present value of the expenditures at the time of initial preservation.
Unlike access, preservation necessarily has a long time horizon. The Digital Preservation Network wanted its members to make a 20-year commitment, but they didn't. One of the important lessons they want the community to learn from its failure is:
While research universities and cultural heritage institutions are innately long-running, they operate on that implicitly rather than by making explicit long-term plans.
|British Library Budget
My model takes a more business-like approach. A customer wants to deliver a collection to the repository for preservation. Typically in academia the reason is that a grant is ending and the data it generated needs a home for the long term, or that a faculty member has died and their legacy needs to be preserved. The repository needs to know how much to charge the customer. There is no future flow of funds from which annual payments can be made; the repository needs a one-time up-front payment to cover the cost of preservation "for ever". The data needs an endowment to pay for its preservation, a concept I first blogged about in 2007. In 2011 I wrote:
Serge Goldstein of Princeton calls it "Pay Once, Store Endlessly". ... Princeton is actually operating an endowed data service, so I will use Goldstein's PowerPoint from last fall's CNI meeting as an example of the typical simple analysis. He computes that if you replace the storage every 4 years and the price drops 20%/yr, you can keep the data forever for twice the initial storage cost.The simple analysis ignores a lot of important factors, and Princeton was under-pricing their service significantly.
The parameters a user of my model can set are these:
As you can see, with the exception of the DiscountRate, they are similar but not identical to the parameters of the other models.
Media Cost Factors
- DriveCost: the initial cost per drive, assumed constant in real dollars.
- DriveTeraByte: the initial number of TB of useful data per drive (i.e. excluding overhead).
- KryderRate: the annual percentage by which DriveTeraByte increases.
- DriveLife: working drives are replaced after this many years.
- DriveFailRate: percentage of drives that fail each year.
Infrastructure Cost factors
- SlotCost: the initial non-media cost of a rack (servers, networking, etc) divided by the number of drive slots.
- SlotRate: the annual percentage by which SlotCost decreases in real terms.
- SlotLife: racks are replaced after this many years
Running Cost Factors
- SlotCostPerYear: the initial running cost per year (labor, power, etc) divided by the number of drive slots.
- LaborPowerRate: the annual percentage by which SlotCostPerYear increases in real terms.
- ReplicationFactor: the number of copies. This need not be an integer, to account for erasure coding.
- DiscountRate: the annual real interest obtained by investing the remaining endowment.
ConclusionByron et al conclude:
Developments in storage technology affect the long-term total cost of ownership for archival systems. We have designed a simulator for archival storage that compares the relative cost of different technologies in an archival system. We found that the growth rates of performance and capacity for different storage technologies predict the cost of using them in archival systems. Hard disks, which require more electricity than other storage devices, offer a competitive solution to tape archival systems, particularly if the archived data must be accessed frequently. Solid state drives, which are more expensive for archival storage than tape or hard disk in terms of capital cost, require less power while offering more throughput than other storage technologies. We observed that the slow pace of development for optical disc technology will cause disc-based archives to become more expensive than other technologies; however, optical disc will remain a viable archival technology if its capacity and throughput increase more rapidly than they have in the past. We observed that the long-term prospect for development varies for different types of technology. Hard disks will likely remain competitive with tape for archival storage systems for years to come notwithstanding the prospect that hard disk capacity will increase more slowly than it has in the past.I agree with all of this, except that I think that they under-estimate the synergistic cost savings available from optical media technology if it can be deployed at Facebook scale.
Given that all current storage technologies are approaching physical limits in the foreseeable future, see Byron et al's Table 9, economic models should follow their example and include decreasing Kryder rates.