Tuesday, April 10, 2018

Natural Redundancy

Most uncompressed files contain significant redundancy, which is why they can be made smaller by a compression algorithm; they work by reducing redundancy. The better the algorithm, the less redundancy left in the output. If the files are then stored for the long term, they need to be protected, for example by erasure coding, which adds some redundancy back. In Exploiting Source Redundancy to Improve the Rate of Polar Codes, Ying Wang, Krishna R. Narayanan and Anxiao (Andrew) Jiang of Texas A&M explore using the original redundancy to reduce the amount of protection redundancy needed for a given level of reliability. Below the fold, some commentary.

In a press release Jiang is quoted as saying:
In real data the redundancy can take on more complex forms, such as a text that talks about "raining" may also talk about 'umbrella' or other related things, or the bits in data may satisfy some mathematical equation, but the principle is the same - once we know the bits in data are dependent on each other in some way, we can use that knowledge to correct errors,
This work is theoretically interesting but of limited use in practical long-term preservation:
  • The threat environment means that future cloud storage systems must be designed to encrypt data at rest (see, for example, Krste Asanović' keynote at FAST14). Effective encryption results in files with no redundancy to exploit. Indeed, most practical encryption systems compress their plaintexts before encryption to remove redundancy.
  • The kind of redundancy Wang et al are enhancing is intended to protect against "bit rot". But this is only one of the many threats to stored data. Cloud providers use erasure coding and other forms of redundancy to provide, for example, geographic dispersion to protect against catastrophic loss of a data center.
  • The redundancy needed for protection is frequently less than the natural redundancy in the uncompressed file. The major threat to stored data is economic, so compressing files before erasure coding them for storage will typically reduce cost and thus enhance data survivability.
For all these reasons there is likely to be little if any redundancy available in practical digital preservation systems for Wang et al's technique to use.

No comments: