Tuesday, March 19, 2019

Compression vs. Preservation

An archive is in a hardware refresh cycle and they have asked me to comment on concerns arising because their favored storage hardware uses data compression, which may not be possible to disable even if doing so were a good idea. This is an issue I wrote about two years ago in Threats to stored data.

Because similar concerns keep re-appearing in discussions of digital preservation, I decided this time to discuss it in the same way as Cloud for Preservation, writing a post with a general discussion of the issues without referring to a specific institution. Below the fold, the details.

First, it is important to distinguish the two kinds of data compression:
  • Lossless compression, in which the output from decompression is identical to the input to compression.
  • Lossy compression, in which the output from decompression is similar to the input to compression, but not identical.
It is important to point out that traditional ingest practice in paper archives uses lossy compression to reduce the vast amount of paper coming in to an amount that is affordable to preserve. In Archival Sampling: A Method of Appraisal and A Means of Retention James Gregory Bradsher and Bruce I. Ambacher write about "quantitative sampling", examining a random sample of the content of incoming boxes of paper:
Quantitative sampling permits statistical analysis, but may not provide for the retention of files with exceptional research potential - those pertaining to important or interesting persons and subjects. Some archivists, therefore, have attempted qualitative selection to provide for the retention of the "exceptional" case files - those that generally would be destroyed if only some form of quantitative sample was retained.
In other words, archival practice is to select both files considered important, and a random sample of other files. The result is a compressed version of the input that, on decompression, yields content similar to the input.

Digital archives have the same problem in spades. The UK National Archive's Best practice guide to appraising and selecting records for The National Archives states:
For over 50 years public records bodies have followed the system of appraisal established by the Grigg Report in 1954. This involves appraisal of records at five years (for ongoing business value and potential archival value) and 25 years (for archival value), followed by the transfer of selected records at 30 years. Using this system records of archival value are identified using file-by-file review. Fifty years on this system is, in most circumstances, no longer viable for a number of reasons:
  • Digital information is vulnerable, and may become inaccessible due to format obsolescence or deletion. Departments need to capture decisions about the value of information early in the information lifecycle. This is vital to ensure essential context is maintained and to allow for efficient and effective appraisal of the information
  • The increasing volume, proliferation and complexity of digital information means that a file-by-file approach is prohibitive and resource-intensive
  • Paper file-by-file review is time consuming, and may not be an efficient use of departmental resources
  • From 2013 the 30 year rule will be gradually reduced to 20 years over a 10-year transition timetable. More records will need to be reviewed in a short space of time to comply with the change in legislation
  • Machinery of government changes (where functions are abolished or transferred) mean that large amounts of information often have to be appraised at short notice
Departments should appraise their records and make selections at the highest level possible, for example, at a business function or series level. Appraisal at a file level (file-by-file review) or individual document level should be reserved for only those cases where appropriate. In addition digital records should be appraised as close to creation as possible.
Thus the input to the preservation function of a digital archive will be the output of a lossy compression process. This isn't just true for national and institutional archives, it is also true for Web archives, which are a sample of, and thus a compressed version of, the Web. And also for e-journal archives, which are a sample of the universe of scholarly journals. It has been estimated that both contain less than half of their respective universes.

The preservation function must ensure that no further unplanned loss occurs, so it can use only lossless compression. The demise of the Digital Preservation Network illustrates the fact that one of the major threats to preserved digital content is running out of money. In theory, using lossless compression can reduce the cost per byte of storage, thereby mitigating this threat to some extent. The questions that arise are:
  • To what extent will compressing the data in this way reduce the cost of storage, and how big an effect would this be on overall preservation cost?
  • Does compressing the data in this way increase vulnerability to non-economic threats?

Effectiveness of Compression

By how much can lossless compression reduce the size of preserved data? That depends on the type of data being compressed. To illustrate the effect, I ran some simple experiments on my desktop computer. The table reports the number of files with the given extension, their median size, the average ratio of the size of the gzip (lossless) compressed file to the original, and the same ratio for the largest file. ".pdf" files are printer output from programs, ".PDF" files are output from my scanner.

Compressed percent of original size
File Type File Count Median KB Avg. % % of max
.html 26342 8 30.50 7.96
.java 92331 49 38.42 2.02
.PDF 2019 637 82.98 76.72
.pdf 4527 223 85.64 89.75
.jpg 21296 614 96.96 99.34
.mp3 977 3,362 98.56 90.38
.m4v 14 92,551 99.72 99.67
.png 8507 14 100.31 100.00

What do we learn from this?
  • Media files are large, so they are stored in formats which use lossy compression. Trying to compress them further has little effect, and can result in files that are significantly larger.
  • Text files, such as those collected in Web archives, can be significantly reduced in size by lossless compression. That is why these files are typically stored in gzip-compressed WARC files.
Of course, the files on my desktop don't reflect all the file types an archive would store. For example, there are no high-resolution sound files or images such as those in the British Library's Gutenberg bibles. Using lossy compression on the preserved versions of these files is clearly inappropriate, however appropriate it may be for their dissemination. Lossless compression can afford significant space saving in this case.

Lets grossly over-simplify and do a back-of-the-envelope calculation. The LOCKSS Program's long-time rule of thumb is that the total lifecycle cost of preserving content is about one-third ingest, one-half preservation, and one-sixth dissemination.

Currently, 10TB drives retail around $300 each. They have a 5-year service life, so cost around $6/TB/year. S3 One-zone IA is $0.01/GB/month, or $120/TB/year. It is clear that raw media cost is a small part of the overall cost of storing data for the long term.

Let us be generous and say that media cost is 25% of storage cost, or 12.5% of total lifecycle preservation cost. Then if compression halved the size of the data it would reduce total lifecycle cost by around 6%.  You can argue with the details but unless special circumstances apply even heroic data compression will have a limited impact on total lifecycle preservation cost.

Risks of Compression

Lossless compression works by eliminating redundancy to shrink the data, preservation works by adding redundancy to safeguard the data. At a fundamental level, lossless compression and preservation are in conflict. But it isn't that simple. The architecture of preservation systems is layered, so redundancy can be added or removed at each level. There are system design tradeoffs to be made in where and by how much redundancy should be increased or decreased.

Lets look at the ingest journey data takes through these layers:
  • Selection process: as described above, this is a lossy process, eliminating not just redundancy but actual data.
  • Preservation process: this layer can both reduce redundancy, for example by using lossless compression on media files, and increase it, by replication or erasure coding.
  • File system: this layer typically neither reduces nor increases redundancy, although file systems that compress content exist, as do file systems that deduplicate (see Caveat below). Neither is appropriate for preservation system use.
  • Media controller: media controllers can significantly increase redundancy, as for example in RAID systems.
  • Storage media: because the data physically stored on the media is in analog form, the data stream recovered by the low-level read process is noisy. Media write electronics typically add redundancy to perform forward error correction on the noisy data stream.
There are two reasons why, no matter what storage system underlies it, the preservation layer needs to add redundancy via replication or erasure coding:
  • As I described in Keeping Bits Safe: How Hard Can It Be?, the reliability needed to keep, for example, a Petabyte for a century with no bit errors is many orders of magnitude beyond the capability of current (or foreseeable) storage media.
  • As the EU DAVID project documented, bit flips in storage media are not a significant cause of data loss in archives compared to, for example, human error. Their report concluded:
    It was acknowledged that some rare cases or corruptions might have been explained by the occurrence of bit rot, but the importance and the risk of this phenomenon was at the present time much lower than any other possible causes of content losses.
By reducing redundancy lossless compression increases the impact of any bit flips that do occur, as detailed in Heydegger's 2008 and 2009 works (Both links are paywalled but the 2008 paper is available via the Wayback Machine). But the preservation layer has to be capable of surviving total loss of an underlying file, or even an entire underlying disk, so it has to be capable of surviving the case in which a single bit flip renders the entire rest of the file unintelligible.

Conclusion

Readers should take away three main points:
  1. The use of lossy compression by the preservation function of an archival system, as opposed to by the creators or selectors of preserved content, is clearly inappropriate.
  2. It is appropriate to use lossless compression for raw media files such as sound files or high-resolution scanned images. It can result in significant cost savings, and is very unlikely to impair overall preservation system reliability.
  3. The use of lossless compression by architects of the lower layers of the preservation function should not cause concern. It is extremely unlikely to impair overall preservation system reliability, though it is equally unlikely to provide significant cost savings because almost all the data being stored will already have been compressed at the top of the preservation stack.

Caveat

This discussion does not apply to deduplication, in which the system detects that duplicate data is being stored in multiple locations, replacing all but one of the duplicates with a pointer to the remaining one. This can convert a system that intends to store Lots Of Copies To Keep Stuff Safe into one that stores a single copy at much lower reliability. Such systems must ensure that each of the Lots Of Copies is physically separate from all the others.

1 comment:

James Doig said...

Thanks David, for another excellent post. It's certainly an interesting point that the selection process is essentially a form of lossy compression.

There are one or two things I don't necessarily agree with, eg "File system: this layer typically neither reduces nor increases redundancy, although file systems that compress content exist, as do file systems that deduplicate (see Caveat below). Neither is appropriate for preservation system use."

I think you're saying that this cancels out redundancy. This is true, however in practice it is normally used to cancel out unintended redundancy and then use other specific functions for intentional redundancy (e.g. cancel out multiple copies of a file saved in numerous network locations, but then replicate these copies to multiple sites for redundancy). The advantage of this is that you are using known, specific techniques for redundancy and saving space where it is not needed rather than depending on incidental or ad-hoc methods for recovery in the event of failures (which are only every likely to allow for recovery of some data, rather than all).

On file system compression - it would be good to hear definitively why it is not appropriate for preservation system use. Are there really compelling reasons to exclude such systems? Of course, a risk assessment would need to be carried out based on the specific system and software in use, but that might determine the risk of turning off the compression is greater, eg a storage system that uses compression by default (turning off the compression would on the face of it be a non-standard implementation).