Thursday, May 15, 2014

Stored safe in the Cloud

Steve Kolowich at The Chronicle of Higher Education reports on a major outage and data loss on May 6 at Dedoose:
Dedoose, a cloud-based application for managing research data, suffered a “devastating” technical failure last week that caused academics across the country to lose large amounts of research work, some of which may be gone for good.
...
The crash nonetheless has dealt frustrating setbacks to a number of researchers, highlighting the risks of entrusting data to third-party stewards.
Below the fold, I look at what has been reported and discuss some of these risks.

Eli Lieber, president of SocioCultural Research Consultants, which sells Dedoose, discussed the problem in a blog post and two follow-up posts. Unfortunately, he didn't follow Amazon's excellent practice of detailed technical diagnosis of such failures. The bulk of the posts described recovery efforts and promises of future system improvement; details were limited to:
This devastating system ‘collision’ of Tuesday night resulted from one of our system-critical Microsoft Azure services failing unexpectedly and, thus, pulling all of Dedoose down.  The timing was particularly bad because it occurred in the midst of a full database encryption and backup.  This backup process, in turn, corrupted our entire storage system.
The blog post contains a number of interesting statements. First:
Our immediate work with Microsoft did not result in any substantial assistance.
This is what Dedoose should have expected. The terms of service for cloud storage services make it clear that if the service causes you to lose data, that is your problem, not theirs.  Eli Lieber says:
This is an unprecedented and completely unanticipated event in the history of Dedoose service
However, just as Microsoft's lawyers did, Dedoose's lawyers did anticipate system failures that would lead to loss of data. Dedoose's terms of service state:

6. No Warranties and Indemnification

YOU UNDERSTAND AND AGREE THAT THE SOFTWARE AND WEBSITE ARE PROVIDED “AS IS” AND SCRC, ITS AFFILIATES, SUPPLIERS AND RESELLERS EXPRESSLY DISCLAIM ALL WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. SCRC, ITS AFFILIATES, SUPPLIERS AND RESELLERS MAKE NO WARRANTY OR REPRESENTATION REGARDING THE RESULTS THAT MAY BE OBTAINED FROM THE USE OF THE SOFTWARE, REGARDING THE ACCURACY OR RELIABILITY OF ANY INFORMATION OBTAINED THROUGH THE SOFTWARE, REGARDING ANY GOODS OR SERVICES PURCHASED OR OBTAINED THROUGH YOUR USE OF THE SOFTWARE OR THE WEBSITE, REGARDING ANY TRANSACTIONS ENTERED INTO THROUGH THE SOFTWARE OR THAT THE SOFTWARE WILL MEET ANY USER'S REQUIREMENTS, OR BE UNINTERRUPTED, TIMELY, SECURE OR ERROR FREE. USE OF THE SOFTWARE IS AT YOUR SOLE RISK. ANY MATERIAL AND/OR DATA DOWNLOADED OR OTHERWISE OBTAINED THROUGH THE USE OF THE SOFTWARE OR WEBSITE IS AT YOUR OWN DISCRETION AND RISK. YOU WILL BE SOLELY RESPONSIBLE FOR ANY DAMAGE TO YOU RESULTING FROM THE USE OF THE SOFTWARE. THE ENTIRE RISK ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE REMAINS WITH YOU.
In other words, just as Dedoose's problems with Azure were Dedoose's problem, the researchers' problems with Dedoose are the researcher's problem. Dedoose will undoubtedly do what it can to recover but they have absolutely no liability whatsoever for the losses their customers suffered.

The lack of a technical diagnosis in the blog post leaves me with a number of questions:
  • Why was "a full database encryption and backup" process, which should only require permission to read the database and write temporary files, running with permissions that allowed it to write the "entire storage system"?
  • Why was "one of our system-critical Microsoft Azure services failing unexpectedly" unexpected? Because the Dedoose system had mysterious capabilities that normally predicted when the services upon which it was based were going to fail so these failures were always expected? Or because the system expected that those services would never fail?
  • Why was the Dedoose system architected with single points of failure so that "one of our system-critical Microsoft Azure services failing" resulted in "pulling all of Dedoose down"?
  • Why does it appear that Dedoose was backing up their data only at monthly intervals?
Dedoose promises that they are:
going tour de force on protection by:
  1. Deploying a database mirror/slave in Azure
  2. Deploying a database mirror/slave into Amazon S3
  3. Keeping a mirror copy of the entire blob storage including all file data, backups, video data synchronized nightly to our private server in an encrypted volume
  4. Storing nightly database backups on the VHD, Azure Blob Storage, and Amazon S3 Storage
  5. Mirroring all Azure file data into an Amazon S3 bucket
  6. Carrying out a weekly restore exercise for the database backups to ensure integrity
  7. Carrying out monthly bare bones from nothing restores of the entire Dedoose Platform to ensure integrity
The first and last two are praiseworthy but would seem to be things that should be been done all along. Running a database without a mirror to fail over to, and trusting that backups will work when you need them without routine testing, seem to show a pretty casual approach.

The others give rise to concern since they amount to making more backups more often. In the abstract this sounds good, but it is important to note that a failure in their backup process "unexpectedly" caused their entire storage system to be corrupted. Repeating the process that caused the failure more frequently will cause more failures than doing nothing, unless the underlying cause of the problem in the backup process has been diagnosed and fixed. There's no evidence in the blog post that this is the case.

The point here is not to rag on Dedoose, but to point out that in the world of cloud services layered on other cloud services, each of which expressly disclaims any liability whatsoever for "merchantability or fitness for a particular purpose", the reliability actually delivered to the end user cannot simply be deduced from knowing that the "data is safe in the cloud". It depends critically upon how defensive each layer in the system is with regard to failures in the layers below it, how much redundancy exists at each layer, and how frequently the redundant replicas are synchronized. This is something into which the end user has no visibility, over which he has no control, and which may change through time with no notice. It is all very well saying caveat emptor, but in the cloud the customer has no access to the information needed to make careful choices, and no guarantee that once a choice is made the basis for that choice remains in force.

3 comments:

  1. Cory Doctorow at Boing Boing points to another cautionary tale for cloud enthusiasts:

    "As Adobe Creative Suite struggles with its license-server outage, stranding creative professionals around the world without a way of earning their living, a timely reminder: a cloud computer is a computer you're only allowed to use if the phone company and a DRM-peddling giant like Adobe gives you permission, and they can withdraw that permission at any time."

    Details at The Register.

    ReplyDelete
  2. Jack Clark at The Register has an interesting analysis of the forces driving the lemming-like rush to the cloud despite the risks, and its effect on traditional enterprise IT companies.

    "What do all ailing enterprise IT companies have in common? Trouble in their core businesses due to the rise of cloud computing.

    The repercussions that the technology is having on the IT business are all around us, and its effects on the industry are as inevitable as gravity on a dropped bowling ball. Cloud computing's rise spells trouble for any traditional Western IT company you care to name, and has already started to bite into them."

    ReplyDelete
  3. The UK National Archives have published Guidance on Cloud Storage and Digital Preservation by Neal Beagrie, Andrew Charlesworth and Paul Miller. I expect to write more about this document but at first glance the tables describing the legal issues look really useful, while the discussion of costs and risks looks insufficiently critical of the industry hype.

    ReplyDelete