DSHR's Blog: Panel at Library of Congress Storage Architectures meeting

Henry Newman and I ran a panel at the Library of Congress' Storage Architectures meeting entitled Cloud Challenges. Below the fold is the text of my brief presentation, entitled Cloud Services: Caveat Emptor, with links to the sources.

The business model of cloud provision, whether commercial or private, depends on three advantages:

Aggregation of many spiky dynamic demands, thereby eliminating over-provisioning costs.
Economies of scale, for example by reducing staff cost per server.
Lower cost of capital.

Each of the three phases of digital preservation, ingest, preservation and dissemination, has different characteristics and is thus impacted by the cloud differently.

The process of ingesting a given chunk of content varies considerably between archives. However, it is typically compute-intensive but limited in duration. Archives operating at scale, such as the Internet Archive, Portico and LOCKSS, ingest a constant flow of content chunks. Their demand for resources during ingest tends to be stable, so the aggregation advantage does not apply. They also operate at a large enough scale that the economies of scale advantage has little effect. The ingest pipeline of large archives is thus not a candidate for outsourcing to the cloud.

Smaller archives that ingest intermittently may find that ingest can be advantageously outsourced. However, the cloud services impose significant per-request and per-byte charges on outbound traffic. Unless the content is also to be stored by the cloud that is ingesting it, these charges could add enough to the cost to make outsourcing ingest to the cloud uneconomic even for smaller archives. It is easy for me, coming from hyper-connected Stanford, to ignore the additional cost of the inbound bandwidth from the archive to the cloud. But a study of the costs of cloud backup showed that these bandwidth charges could also be very significant.

The bulk of the cost of preservation is storage. An archive's demand for storage is monotonically increasing; it is the canonical example of base-load demand for which the aggregation advantage does not apply. Further, multiple studies of the economics of cloud storage have shown that while there may well be economies of scale in storage, and commercial cloud providers may well have much lower cost of capital, they are keeping these benefits to themselves. The cost to the customer of these storage services is much higher than doing it themselves. Even Glacier, Amazon's long-term storage product, is only competitive under very favorable assumptions. And note that cloud preservation services, such as Preservica and DuraCloud, have to layer their margins on top of the cloud storage service's already extortionate margins.

Access to preserved content is a more interesting case for the cloud. Lets look at the Library of Congress' Twitter archive as an example. The Library had, as of January, 400 requests to access the archive, then 130TB and growing 190GB/day. Unfortunately, the best they could afford to do with the feed was to make two copies on tape. Providing the kind of data-mining capabilities researchers need would involve not merely keeping a copy on hard disk, but also building a substantial compute farm to run the queries. Both of them would be idle much of the time; 400 requests for access is a lot, but its not nearly enough to keep these resources busy 24/7. As I blogged at the time:

This pattern of demand for compute resources, being spiky, is ideal for cloud computing services. ... Suppose ... that in addition to keeping the two archive copies on tape, the Library kept one copy in S3's Reduced Redundancy Storage simply to enable researchers to access it. Right now it would be costing $7692/mo. Each month this would increase by $319. So a year would cost $115,272. Scholars wanting to access the collection would have to pay for their own computing resources at Amazon, and the per-request charges; because the data transfers would be internal to Amazon there would not be bandwidth charges. The storage charges could be ... charged back to the researchers. ... the 400 outstanding requests would each need to pay about $300 for a year's access to the collection, not an unreasonable charge. If this idea turned out to be a failure it could be terminated with no further cost, the collection would still be safe on tape. In the short term, using cloud storage for an access copy of large, popular collections may be a cost-effective approach.

But, as with ingest, unless the access is coming from within the same cloud it will incur a per-byte charge. Fortunately, for collections that are freely available to the public (as the Twitter archive could not be), Amazon announced back in December 2008 their Free Public Datasets service. This allows owners of moderate-sized collections to host their collection at Amazon without paying an unpredictable and uncontrollable amount each month in access charges. The collection is hosted as EBS images from which users can clone their own EBS copies. The compute, storage and request charges are billed to the user, not to the archive. I doubt this would work for Twitter-size collections; Amazon says:

Typically the data sets in the repository are between 1 GB to 1 TB in size ..., but we can work with you to host larger data sets as well.

Two other issues deserve mention. If commercial cloud services are really much more expensive than doing it yourself, how do they stay in business? First, for most of their customers, whose usage patterns are quite unlike those of preservation, they are not more expensive. The aggregation and economies of scale advantages are very effective for the bulk of their customers, who do have spiky demands. Second, the services use the drug dealer's algorithm, providing free introductory and low initial charges. Once a customer has built up a commitment, particularly if they consume a lot of storage, they find that the bandwidth charges in particular are an effective barrier to switching. They won't find a much cheaper competing cloud provider to switch to; the market is dominated by Amazon and the other services are price followers.

Amazon's dominance of the cloud services market is total:

more than five times the compute capacity in use than the aggregate total of the other fourteen providers

It illustrates the strategy that has driven their growth from the start. They have a very long planning horizon, and are willing to run a business on low margins for a long time to drive their competitors out and lock their customers in, both of whom have much shorter planning horizons. Despite the low margins, the stock market understands this and gives them a $130B valuation and thus a low cost of capital to fund the process. With the competitors gone and the customers locked in they can really make money. A recent report in the New York Times shows this happening in the market for books:

Amazon sells about one in four printed books, ... a level of market domination with little precedent in the book trade. ... Now, with Borders dead, Barnes & Noble struggling and independent booksellers greatly diminished, for many consumers there is simply no other way to get many books than through Amazon. And for some books, Amazon is, in effect, beginning to raise prices.

Of course, there is a mismatch between Amazon's overall low margins and the extortionate margins I believe they have on their storage business. I delve into this anomaly on my blog, but it arises from the circumstances of Amazon's early entry into the market, and the lack of significant price competition to force Amazon to drop prices as costs dropped.

To sum up, commercial cloud services are probably not suitable for ingest, definitely not suitable for preservation, possibly suitable for dissemination to a restricted audience, if suitable charging arrangements are in place, and highly suitable for dissemination of medium-sized public data. If private or collaborative cloud services can overcome their higher cost of capital and sustain a business model that charges against costs, rather than against value for the bulk of their customers, they might be viable in each of these areas.

Notice that I have not mentioned any technical aspects of cloud services, these conclusions arise solely from economic and business factors. The reason is that the cloud is not technology; it is built from components you can buy using techniques you can use. What is different about the cloud is the business model, not the technology.

DSHR's Blog

Tuesday, September 24, 2013

Panel at Library of Congress Storage Architectures meeting

No comments: