Saturday, September 20, 2014

Utah State Archives has a problem

A recent discussion on the NDSA mailing list featured discussion about the Utah State Archives struggling with the costs of being forced to use Utah's state IT infrastructure for preservation. Below the fold, some quick comments.



Here's summary of the situation the Archives finds itself in:
we actually have two separate copies of the AIP. One is on m-disc and the other is on spinning disk (a relatively inexpensive NAS device connected to our server, for which we pay our IT department each month). ... We have centralized IT, where there is one big data center and servers are virtualized. Our IT charges us a monthly rate for not just storage, but also all of their overhead to exist as a department. ... and we are required by statute to cooperate with IT in this model, so we can't just go out and buy/install whatever we want. For an archives, that's a problem, because our biggest need is storage but we are funded based upon the number of people we employ, not the quantity of data we need to store, and convincing the Legislature that we need $250,000/year for just one copy of 50 TB of data is a hard sell, never mind additional copies for SIP, AIP, and/or DIP.
Michelle Kimpton, who is in the business of persuading people that using DuraCloud is cheaper and better than doing it yourself, leaped at the opportunity this offered (my emphasis):
If I look at Utah State Archive storage cost, at $5,000 per year per TB vs. Amazon S3 at $370/year/TB it is such a big gap I have a hard time believing that Central IT organizations will be sustainable in the long run.  Not that Amazon is the answer to everything, but they have certainly put a stake in the ground regarding what spinning disk costs, fully loaded( meaning this includes utilities, building and personnel). Amazon S3 also provides 3 copies, 2 onsite and one in another data center.

I am not advocating by any means that S3 is the answer to it all, but it is quite telling to compare the fully loaded TB cost from an internal IT shop vs. the fully loaded TB cost from Amazon.

I appreciate you sharing the numbers Elizabeth and it is great your IT group has calculated what I am guessing is the true cost for managing data locally.
Elizabeth Perkes for the Archives responded:
I think using Amazon costs more than just their fees, because someone locally still has to manage any server space you use in the cloud and make sure the infrastructure is updated. So then you either need to train your archives staff how to be a system administrator, or pay someone in the IT community an hourly rate to do that job. Depending on who you get, hourly rates can cost between $75-150/hour, and server administration is generally needed at least an hour per week, so the annual cost of that service is an additional $3,900-$7,800. Utah's IT rate is based on all costs to operate for all services, as I understand it. We have been using a special billing rate for our NAS device, which reflects more of the actual storage costs than the overhead, but then the auditors look at that and ask why that rate isn't available to everyone, so now IT is tempted to scale that back. I just looked at the standard published FY15 rates, and they have dropped from what they were a couple of years ago. The official storage rate is now .2386/GB/month, which is $143,160/year for 50 TB, or $2,863.20 per TB/year.
But this doesn't get at the fundamental flaws in Michelle's marketing:
  • She suggests that Utah's IT charges reflect "the true cost for managing data locally". But that isn't what the Utah Archives are doing. They are buying IT services from a competitor to Amazon, one that they are required by statute to buy from. 
  • She compares Utah's IT with S3. S3 is a storage-only product. Using it cost-effectively, as Elizabeth points out, involves also buying AWS compute services, which is a separate business of Amazon's with its own P&L and pricing policies. For the Archives, Utah IT is in effect both S3 and AWS, so the comparison is misleading.
  • The comparison is misleading in another way. Long-term, reliable storage is not the business Utah IT is in. The Archives are buying storage services from a compute provider, not a storage provider. It isn't surprising that the pricing isn't competitive.
  • But more to the point, why would Utah IT bother to be competitive? Their customers can't go any place else, so they are bound to get gouged. I'm surprised that Utah IT is only charging 10 times the going rate for an inferior storage product
  • And don't fall for the idea that Utah IT is only charging what they need to cover their costs. They control the costs, and they have absolutely no incentive to minimize them. If an organization can hire more staff and pass the cost of doing so on to customers who are bound by statute to pay for them, it is going to hire a lot more staff than an organization whose customers can walk.
As I've pointed out before, Amazon's margins on S3 are enviable. You don't need to be very big to have economies of scale enough to undercut S3, as the numbers from Backblaze demonstrate. The Archive's 50TB is possibly not enough to do this if they were actually managing the data locally.

But the Archive might well employ a strategy similar to that I suggested for the Library of Congress Twitter collection. They already keep a copy on m-disk. Suppose they kept two copies on m-disk as the Library keeps two copies on tape, and regarded that as their preservation solution. Then they could use Amazon's Reduced Redundancy Storage and AWS virtual servers as their access solution. Running frequent integrity checks might take an additional small AWS instance, and any damage detected could be repaired from one of the m-disk copies.

Using the cloud for preservation is almost always a bad idea. Preservation is a base-load activity whereas the cloud is priced as a peak-load product. But the spiky nature of current access to archival collections is ideal for the cloud.

2 comments:

David. said...

Henry Newman has a post commenting on this.

Brad Jensen said...

Base-load versus peak-load doesn't make much difference when Amazon has beaten down the price so far. The bigger issues I see here are data access time and redundancy.

Does the central IT department keep all of the data in at least two widely separate (hundreds of miles) locations and keep the copy updated in near realtime?

Multiple local copies of an archive are better than nothing, but not much better.

On the other hand, cloud storage is slow storage, which shows up when you are moving lots of large files around. Scanned images and picture-based PDFs come to mind.

Are you sending your archive to the cloud to leave it there just in case something horrible happens to your local storage, or are you actually trying to replace the local NAS and access everything from the cloud in real time?

How big is your internet pipeline, and how much of it can you dedicate to this cloud access?

On the other hand, your central IT department seems to have their pricing wrong if they are charging your data-centric operation the same storage pricing that they are charging their service-centric internal customers.

If the IT department is only keeping one copy of the data, you might consider getting another NAS to use locally. You can link two Synology 12 bay NAS together as one unit.

Two of them would cost under $4000. 24 6 TB HGST NAS drives are under $7200. That would give you 144 TB of storage, which you could set up mirrored as RAID10 for 72TB usable bytes of storage with very good protection.

Total hardware cost under $12,000.

I'll bet CDW could help you do this. You could make an additional copy of your entire archive if you have to keep paying through the nose to Central IT.

This ain't rocket surgery.

PS I've been in the document storage software business since 1989. I am not offering to sell you anything like this, that's not what I do.