Tuesday, August 22, 2017

Economic Model of Long-Term Storage

Cost vs. Kryder rate
As I wrote last month in Patting Myself On The Back, I started working on economic models of long-term storage six years ago. I got a small amount of funding from the Library of Congress; when that ran out I transferred the work to students at UC Santa Cruz's Storage Systems Research Center. This work was published here in 2012 and in later papers (see here).

What I wanted was a rough-and-ready Web page that would allow interested people to play "what if" games. What the students wanted was something academically respectable enough to get them credit. So the models accumulated lots of interesting details.

But the details weren't actually useful. The extra realism they provided was swamped by the uncertainty from the "known unknowns" of the future Kryder and interest rates. So I never got the rough-and-ready Web page. Below the fold, I bring the story up-to-date and point to a little Web site that may be useful.

Earlier this year the Internet Archive asked me to update the numbers we had been working with all those years ago. And, being retired with time on my hands (not!), I decided instead to start again. I built an extremely simple version of my original economic model, eliminating all the details that weren't relevant to the Internet Archive and everything else that was too complex to implement at short notice, and put it behind an equally simple Web site running on a Raspberry Pi (so please don't beat up on it).

What This Model Does

For a single Terabyte of data, the model computes the endowment, the money which deposited with the Terabyte and invested at interest would suffice to pay for the storage of the data "for ever" (actually 100 years in this model).

Assumptions

These are the less than totally realistic assumptions underlying the model:
  • Drive cost is constant, although each year the same cost buys drives with more capacity as given by the Kryder rate.
  • The interest rate and the Kryder rate do not vary for the duration.
  • The storage infrastructure consists of multiple racks, containing multiple slots for drives. I.e. the Terabyte occupies a very small fraction of the infrastructure.
  • The number of drive slots per rack is constant.
  • Ingesting the Terabyte into the infrastructure incurs no cost.
  • The failure rate of drives is constant and known in advance, so that exactly the right number of spare drives is included in each purchase to ensure that failed drives can be replaced by an identical drive.
  • Drives are replaced after their specified life although they are still working.
Some of these assumptions may get removed in the future (see below).

Parameters

This model's adjustable parameters are as follows.

Media Cost Factors

  • DriveCost: the initial cost per drive, assumed constant in real dollars.
  • DriveTeraByte: the initial number of TB of useful data per drive (i.e. excluding overhead).
  • KryderRate: the annual percentage by which DriveTeraByte increases.
  • DriveLife: working drives are replaced after this many years.
  • DriveFailRate: percentage of drives that fail each year.

Infrastructure Cost factors

  • SlotCost: the initial non-media cost of a rack (servers, networking, etc) divided by the number of drive slots.
  • SlotRate: the annual percentage by which SlotCost decreases in real terms.
  • SlotLife: racks are replaced after this many years

Running Cost Factors

  • SlotCostPerYear: the initial running cost per year (labor, power, etc) divided by the number of drive slots.
  • LaborPowerRate: the annual percentage by which SlotCostPerYear increases in real terms.
  • ReplicationFactor: the number of copies. This need not be an integer, to account for erasure coding.

Financial Factors

  • DiscountRate: the annual real interest obtained by investing the remaining endowment.

Defaults

The defaults are my invention for a rack full of 8TB drives. They should not be construed as representing the reality of your storage infrastructure. If you want to use the output of this model, for example for budgeting purposes, you need to determine your own values for the various parameters.

Default values
Parameter Value Units
DriveCost250.00Initial $
DriveTeraByte7.2Usable TB per drive
KryderRate10% per year
DriveLife4years
DriveFailRate2% per year
SlotCost150.00Initial $
SlotRate0% per year
SlotLife8years
SlotCostPerYear100.00Initial $ per year
LaborPowerRate4% per year
DiscountRate2% per year
ReplicationFactor2# of copies

Unlike the KryderRate and the SlotRate, the LaborPowerRate reflects that the real cost of staff increases over time. Of course, the capacity of the slots is typically increasing faster than the LaborPowerRate, so the per-Terabyte cost from the LaborPowerRate still decreases over time. Nevertheless, the endowment calculated is quite sensitive to the value of the LaborPowerRate.

Calculation

The model works through the 100-year duration year by year. Each year it figures out the payments needed to keep the Terabyte stored, including running costs and equipment purchases. It then uses the DiscountRate to figure out how much would have to have been invested at the start to supply that amount at that time. In other words, it computes the Net Present Value of each year's expenditure and sums them to compute the endowment needed to pay for storage over the full duration.

Usage

Sample model output
The Web site provides two ways to use the model:
The sample graph shows why adding lots of detail to the model isn't really useful, because the effects of the unknowable future DiscountRate and KryderRate parameters are so large.

Code

The code is here under an Apache 2.0 license.

What This Model Doesn't (Yet) Do

If I can find the time, some of these deficiencies in the model may be removed:
  • Unlike earlier published research, this model ignores the cost of ingesting the data in the first place, and accessing it later. Experience suggests the following rule of thumb: ingest is half the total lifetime cost, storage is one-third the total lifetime cost, and access is one-sixth. Thus a reasonable estimate of the total preservation cost of a Terabyte is three times the result of this model.
  • The model assumes that the parameters are constant through time. Historically, interest rates, the Kryder rate, labor costs, etc. have varied, and thus should be modeled using Monte Carlo techniques and a probability distribution for each such parameter. It is possible for real interest rates to go negative, disk cost per Terabyte to spike upwards, as it did after the Thai floods, and so on. These low-probability events can have a large effect on the endowment needed, but are excluded from this model. Fixing this needs more CPU power than a Raspberry Pi.
  • There are a number of different possible policies for handling the inevitable drive failures, and different ways to model each of them. This model assumes that it is possible to predict at the time a batch of drives is purchased what proportion of them will fail, and inflates the purchase cost by that factor. This models the policy of buying extra drives so that failures can be replaced by the same drive model.
  • The model assumes that drives are replaced after DriveLife years even though they are working. Continuing to use the drives beyond this can have significant effects on the endowment (see this paper).

4 comments:

Rick Levine said...

Nice post, and model, even if it can't be predictive. Might want to throw in inflation. Effect might be large, given the decade average in the US hasn't been lower than 2% for the last 100 years. Revised code, assumed to be buggy, here.

Rick

David. said...

Rick, please read the post more carefully:

"constant in real dollars"

and:

"annual real interest"

The model works in real dollars, that is after adjusting for inflation. In other words, your idea of the average future rate of inflation needs to be subtracted from your idea of the KryderRate, SlotRate and LaborPowerRate in nominal dollars. Adding an inflation parameter would be double-counting.

Rick Levine said...

Oops. Apologies. I should've caught that by inference from your straw-man 2% discount rate, as well.

David. said...

I want to use the Pi for something else, so I have taken the model down.

If you need to use the model please install it on your own hardware from github:

https://github.com/dshrosenthal/EconomicModel

If this isn't possible, post a comment and I'll see if I can resurrect the model.