|Cost vs. Kryder rate|
What I wanted was a rough-and-ready Web page that would allow interested people to play "what if" games. What the students wanted was something academically respectable enough to get them credit. So the models accumulated lots of interesting details.
But the details weren't actually useful. The extra realism they provided was swamped by the uncertainty from the "known unknowns" of the future Kryder and interest rates. So I never got the rough-and-ready Web page. Below the fold, I bring the story up-to-date and point to a little Web site that may be useful.
Earlier this year the Internet Archive asked me to update the numbers we had been working with all those years ago. And, being retired with time on my hands (not!), I decided instead to start again. I built an extremely simple version of my original economic model, eliminating all the details that weren't relevant to the Internet Archive and everything else that was too complex to implement at short notice, and put it behind an equally simple Web site running on a Raspberry Pi (so please don't beat up on it).
What This Model DoesFor a single Terabyte of data, the model computes the endowment, the money which deposited with the Terabyte and invested at interest would suffice to pay for the storage of the data "for ever" (actually 100 years in this model).
AssumptionsThese are the less than totally realistic assumptions underlying the model:
- Drive cost is constant, although each year the same cost buys drives with more capacity as given by the Kryder rate.
- The interest rate and the Kryder rate do not vary for the duration.
- The storage infrastructure consists of multiple racks, containing multiple slots for drives. I.e. the Terabyte occupies a very small fraction of the infrastructure.
- The number of drive slots per rack is constant.
- Ingesting the Terabyte into the infrastructure incurs no cost.
- The failure rate of drives is constant and known in advance, so that exactly the right number of spare drives is included in each purchase to ensure that failed drives can be replaced by an identical drive.
- Drives are replaced after their specified life although they are still working.
ParametersThis model's adjustable parameters are as follows.
Media Cost Factors
- DriveCost: the initial cost per drive, assumed constant in real dollars.
- DriveTeraByte: the initial number of TB of useful data per drive (i.e. excluding overhead).
- KryderRate: the annual percentage by which DriveTeraByte increases.
- DriveLife: working drives are replaced after this many years.
- DriveFailRate: percentage of drives that fail each year.
Infrastructure Cost factors
- SlotCost: the initial non-media cost of a rack (servers, networking, etc) divided by the number of drive slots.
- SlotRate: the annual percentage by which SlotCost decreases in real terms.
- SlotLife: racks are replaced after this many years
Running Cost Factors
- SlotCostPerYear: the initial running cost per year (labor, power, etc) divided by the number of drive slots.
- LaborPowerRate: the annual percentage by which SlotCostPerYear increases in real terms.
- ReplicationFactor: the number of copies. This need not be an integer, to account for erasure coding.
- DiscountRate: the annual real interest obtained by investing the remaining endowment.
DefaultsThe defaults are my invention for a rack full of 8TB drives. They should not be construed as representing the reality of your storage infrastructure. If you want to use the output of this model, for example for budgeting purposes, you need to determine your own values for the various parameters.
|DriveTeraByte||7.2||Usable TB per drive|
|KryderRate||10||% per year|
|DriveFailRate||2||% per year|
|SlotRate||0||% per year|
|SlotCostPerYear||100.00||Initial $ per year|
|LaborPowerRate||4||% per year|
|DiscountRate||2||% per year|
|ReplicationFactor||2||# of copies|
Unlike the KryderRate and the SlotRate, the LaborPowerRate reflects that the real cost of staff increases over time. Of course, the capacity of the slots is typically increasing faster than the LaborPowerRate, so the per-Terabyte cost from the LaborPowerRate still decreases over time. Nevertheless, the endowment calculated is quite sensitive to the value of the LaborPowerRate.
CalculationThe model works through the 100-year duration year by year. Each year it figures out the payments needed to keep the Terabyte stored, including running costs and equipment purchases. It then uses the DiscountRate to figure out how much would have to have been invested at the start to supply that amount at that time. In other words, it computes the Net Present Value of each year's expenditure and sums them to compute the endowment needed to pay for storage over the full duration.
|Sample model output|
- Provide a set of parameters including a DiscountRate and a KryderRate, and compute the model's estimate of the endowment.
- Provide a set of parameters excluding the DiscountRate and the KryderRate, and draw a graph of how the model's estimate of the endowment varies with the DiscountRate and KryderRate for reasonable ranges of these two parameters.
CodeThe code is here under an Apache 2.0 license.
What This Model Doesn't (Yet) DoIf I can find the time, some of these deficiencies in the model may be removed:
- Unlike earlier published research, this model ignores the cost of ingesting the data in the first place, and accessing it later. Experience suggests the following rule of thumb: ingest is half the total lifetime cost, storage is one-third the total lifetime cost, and access is one-sixth. Thus a reasonable estimate of the total preservation cost of a Terabyte is three times the result of this model.
- The model assumes that the parameters are constant through time. Historically, interest rates, the Kryder rate, labor costs, etc. have varied, and thus should be modeled using Monte Carlo techniques and a probability distribution for each such parameter. It is possible for real interest rates to go negative, disk cost per Terabyte to spike upwards, as it did after the Thai floods, and so on. These low-probability events can have a large effect on the endowment needed, but are excluded from this model. Fixing this needs more CPU power than a Raspberry Pi.
- There are a number of different possible policies for handling the inevitable drive failures, and different ways to model each of them. This model assumes that it is possible to predict at the time a batch of drives is purchased what proportion of them will fail, and inflates the purchase cost by that factor. This models the policy of buying extra drives so that failures can be replaced by the same drive model.
- The model assumes that drives are replaced after DriveLife years even though they are working. Continuing to use the drives beyond this can have significant effects on the endowment (see this paper).