Description
A minimal model of the endowment needed to store one terabyte on disk forever under a set of assumptions which can be adjusted.
It is an initial, greatly simplified version of earlier work by the LOCKSS Program and students at the Storage Systems Research Center of UC Santa Cruz.
This work was published here in 2012 and in later papers (see here).
The endowment is the money which, deposited with the data and invested at interest, suffices to pay for the storage of (in this case) a terabyte "forever", which in this model is 100 years.
Parameters
This model's parameters are as follows.
Media Cost Factors
- DriveCost
- The initial cost per drive, assumed constant in real dollars.
- DriveTeraByte
- The initial number of TB of useful data per drive (i.e. excluding overhead).
- KryderRate
- The annual percentage by which DriveTeraByte increases.
- DriveLife
- Working drives are replaced after this many years.
- DriveFailRate
- Percentage of drives that fail each year.
Infrastructure Cost factors
- SlotCost
- The initial non-media cost of a rack (servers, networking, etc) divided by the number of drive slots.
- SlotRate
- The annual percentage by which SlotCost decreases in real terms.
- SlotLife
- Racks are replaced after this many years
Running Cost Factors
- SlotCostPerYear
- The initial running cost per year (labor, power, etc) divided by the number of drive slots.
- LaborPowerRate
- The annual percentage by which SlotCostPerYear increases in real terms.
- ReplicationFactor
- The number of copies. This need not be an integer, to account for erasure coding.
Financial Factors
- DiscountRate
- The annual real interest obtained by investing the remaining endowment.
Assumptions
- Unlike earlier published research, this model ignores the cost of ingesting the data in the first place, and accessing it later. Experience suggests the following rule of thumb: ingest is half the total lifetime cost, storage is one-third the total lifetime cost, and access is one-sixth. Thus a reasonable estimate of the total preservation cost of a terabyte is three times the result of this model.
- The model assumes that the parameters are constant through time. Historically, interest rates, the Kryder rate, labor costs, etc. have varied, and thus should be modelled using Monte Carlo techniques and a probability distribution for each such parameter. It is possible for real interest rates to go negative, disk cost per terabyte to spike upwards, as it did after the Thai floods, and so on. These low-probability events can have a large effect on the endowment needed, but are excluded from this model.
- There are a number of different possible policies for handling the inevitable disk failures, and different ways to model each of them. This model assumes that it is possible to predict at the time a batch of disks is purchased what proportion of them will fail, and inflates the purchase cost by that factor. This models the policy of buying extra drives so that failures can be replaced by the same drive model.
- The model assumes that drives are replaced after DriveLife years even though they are working. Continuing to use the drives beyond this can have significant effects on the endowment, see this paper.