Friday, May 13, 2016

Finalizing a compression format

With Zstandard v1.0 looming ahead, the last major item for zstd to settle is an extended set of features for its frame encapsulation layer.

Quick overview of the design : data compressed by zstd is cut into blocks. A compressed block has a maximum content size (128 KB), so obviously if input data is larger than this, it will have to occupy multiple blocks.
The frame layer organize these blocks into a single content. It also provides to the decoder a set of properties that the encoder pledges to respect. These properties allow a decoder to prepare required resources, such as allocating enough memory.

The current frame layer only stores 1 identifier and 2 parameters  :
  • frame Id : It simply tells what are the expected frame and compression formats for follow. This is currently use to automatically detect legacy formats (v0.5.x, v0.4.x, etc.) and select the right decoder for them. It occupies the first 4 bytes of a frame.
  • windowLog : This is the maximum search distance that will be used by the encoder. It is also the maximum block size, when (1<<windowLog) < MaxBlockSize (== 128 KB). This is enough for a decoder to guarantee successful decoding operation using a limited buffer budget, whatever the real content size is (endless streaming included).
  • contentSize : This is the amount of data to decode within this frame. This information is optional. It can be used to allocate the exact amount of memory for the object to decode.

These information may seem redundant.
Indeed, for a few situations, they are : when contentSize  < (1<<windowLog). In which case, it's enough to allocated contentSize bytes for decoding, and windowLog is just redundant.
But for all other situations, windowLog is useful : either contentSize is unknown (it wasn't known at the beginning of the frame and was only discovered on frame termination), or windowLog defines a smaller memory budget than contentSize, in which case, it can be used to limit memory budget.

That's all there is for v0.6.x. Arguably, that's a pretty small list.

The intention is to create a more feature complete frame format for v1.0.
Here is a list of features considered, in priority order :
  • Content Checksum : objective is to validate that decoded content is correct.
  • Dictionary ID : objective is to confirm or detect dictionary mismatch, for files which require a dictionary for correct decompression. Without it, a wrong dictionary could be picked, resulting in silent corruption (or an error).
  • Custom content, aka skippable frames : the objective is to allow users to embed custom elements (comments, indexes, etc.) within a file consisting of multiple concatenated frames.
  • Custom window sizes, including non power of 2 : extend current windowLog scheme, to allow more precise choices.
  • Header checksum : validate that checksum informations are not accidentally distorted.
Each of these bullet points introduce its own set of questions, that is detailed below :

Content checksum
The goal of this field is obvious : validate that decoded content is correct. But there are many little details to select.

Content checksum only protects against accidental errors (transmission, storage, bugs, etc). It's not an electronic "signature".

1) Should it be enabled or disabled by default (field == 0) ?

Suggestion : disabled by default
Reasoning : There are already a lot of checksum around, in storage, in transmission, etc. Consequently, errors are now pretty rare, and when they happen, they tend to be "large" rather than sparse. Also, zstd is likely to detect errors just by parsing the compressed input anyway.

2) Which algorithm ? Should it be selectable ?

Suggestion : xxh64, additional header bit reserved in case of additional checksum, but just a single one defined in v1.
Reasoning : we have transitioned to a 64-bits world. 64-bits checksum are faster to generate than 32-bits ones on such systems. So let's use the faster ones.
xxh64 also has excellent distribution properties, and is highly portable (no dependency on hardware capability). It can be run in 32-bits mode if need be.

3) How many bits for the checksum ?

Current format defines the "frame end mark" as a 3-bytes field, the same size as a block header, which is no accident : it makes parsing easier. This field has a 2-bits header, hence 22 bits free, which can be used for a content checksum. This wouldn't increase the frame size.

22-bits means there is a 1 in 4 millions chances of collision in case of error. Or said differently, there are 4194303 chances out of 4194304 to detect a decoding error (on top of all the syntax verification which are inherent to the format itself). That's more than > 99.9999 %. Good enough in my view.

Dictionary ID

Data compressed using a dictionary needs the exact same one to be regenerated. But no control is done on the dictionary itself. In case of wrong dictionary selection, it can result in a data corruption scenario.

The corruption is likely to be detected by parsing the compressed format (or thanks to the previously described optional content checksum field).
But an even better outcome would be detect such mismatch immediately, before starting decompression, and with a clearer error message/id than "corruption", which is too generic.

For that, it would be enough to embed a "Dictionary ID" into the frame.
The Dictionary ID would simply be a random value stored inside the dictionary (or an assigned one, provided the user as a way to control that he doesn't re-use the same value multiple times). A comparison between the ID in the frame and the ID in the dictionary will be enough to detect the mismatch.

A simple question is : how long should be this ID ? 1, 2, 4 bytes ?
In my view, 4 bytes is enough for a random-based ID, since it makes the probability of collision very low. But that's still 4 more bytes to fit into the frame header. In some ways it can be considered an efficiency issue.
Maybe some people will prefer 2 bytes ? or maybe even 1 byte (notably for manually assigned ID values) ? or maybe even 0 bytes ?

It's unclear, and I guess multiple scenarios will have different answers.
So maybe a good solution would be to support all 4 possibilities in the format, and default to 4-bytes ID when using dictionary compression.

Note that if saving headers is important for your scenario, it's also possible to use frame-less block format ( ZSTD_compressBlock(), ZSTD_decompressBlock() ), which will remove any frame header, saving 12+ bytes in the process. It looks like a small saving, but when the corpus consists of lot of small messages of ~50 bytes each, it makes quite a difference. The application will have to save metadata on its own (what's the correct dictionary, compression size, decompressed size, etc.).

Custom content

Embedding custom content can be useful for a lot of unforeseen applications.
For example, it could contain a custom index into compressed content, or a file descriptor, or just some user comment.

The only thing that a standard decoder can do is skip this section. Dealing with its content is within application-specific realm.

The lz4 frame format already defines such container, as skippable frames. It looks good enough, so let's re-use the same definition.

Custom window sizes

The current frame format allows defining window sizes from 4 KB to 128 MB, all intermediate sizes being strict power of 2 (8 KB, 16 KB, etc.). It works fine, but maybe some user would find its granularity or limits insufficient.
There are 2 parts to consider :

- Allowing larger sizes : the current implementation will have troubles handling window sizes > 256 MB. That being said, it's an implementation issue, not a format issue. An improved version could likely work with larger sizes (at the cost of some complexity).
From a frame format perspective, allowing larger sizes can be as easy as keeping a reserved bit for later.

- Non-power of 2 sizes : Good news is, the internals within zstd are not tied to a specific power of 2, so the problem is limited to sending more precise window sizes. This requires more header bits.
Maybe an unsigned 32-bits value would be good enough for such use.
Note that it doesn't make sense to specify a larger window size than content size. Such case should be automatically avoided by the encoder. As to the decoder, it's unclear how it should react : stop and issue an error ? proceed with allocating the larger window size ? or use the smaller content size, and issue an error if the content ends up larger than that ?
Anyway, in many cases, what the user is likely to want is simply enough size for the frame content. In which case, a simple "refer to frame content size" is probably the better solution, with no additional field needed.

Header Checksum

The intention is to catch errors in the frame header before they translate into larger problems for the decoder. Note that only errors can be caught this way : intentional data tampering can simply rebuild the checksum, hence remain undetected.

Suggestion : this is not necessary.

While transmission errors used to be more common a few decades ago, they are much less of threat today, or they tend to garbage some large sections (not just a few bits).
An erroneous header can nonetheless be detected just by parsing it, considering the number of reserved bits and forbidden value. They must all be validated.
The nail in the coffin is that we do no longer trust headers, as they can be abused by remote attackers to deliver an exploit. And that's an area where the header checksum is simply useless. Every field must be validated, and all accepted values must have controllable effects (for example, if the attacker intentionally requests a lot of memory, the decoder shall put a high limit to the accepted amount, and check the allocation result).
So we already are highly protected against errors, by design, because we must be protected against intentional attacks.

Future features : forward and bakward compatibility

It's also important to design from day 1 a header format able to safely accommodate future features, with regards to version discrepancy.

The basic idea is to keep a number of reserved bits for these features, set to 0 while waiting for some future definition.

It seems also interesting to split these reserved bits into 2 categories :
- Optional and skippable features : these are features which a decoder can safely ignore, without jeopardizing decompression result. For example, a purely informational signal with no impact on decompression.
- Future features, disabled by default (0): these features can have unpredictable impact on compression format, such as : adding a new field costing a few more bytes. A non-compatible decoder cannot take the risk to proceed with decompression. It will stop on detecting such a reserved bit to 1 and gives an error message.

While it's great to keep room for the future, it should not take a too much toll in the present. So only a few bits will be reserved. If more are needed, it simply means another frame format is necessary. It's enough in such case to use a different frame identifier (First 4 bytes of a frame).


  1. about dictionary size: consider the format live of 10-20 years and ensure that it will still work. since larger dictionary is anyway breaking compatibility, it may just enough to have "extended header" feature

    your thoughts somethat resembles mmy own work on freearc 2.0 format, so i can give a few tips:

    format starts with bit fields specifying presence of existing features. it may be further extended by extra bytes with fields for new features that aren't known at 1.0 timeframe. last bit of each extended byte means "extend me to next byte". so, you have N bytes with mandatory fields existing at 1.0 timeframe, and standard extension mechanism for newer versions - some, usually last, bit of format X means "extend me with format X+1". iff decoder doesn't support format X+1, it should fail seeing this bit enabled

    optonal fields should be treated in another way. one possibility is to attach ID+size to each field, f.e. using 7+1 encoding: 0x12 0x81 0x23 means field with ID "0x12" and size 0x01 * 128 + 0x23. this way you can extend header with arbitrary fields

    another way is to usу the same scheme as above, only with separate bute sequence, so decoder just stops readimng properties when it doesn't know the X+1 format

    this way you can drop support of large dictionaries in the 1.0, but i suggest to reserve 3-4 bits for "mantissa" part of dictsize. it may greatly improve memory usage in pretty usual scenarios

    1. Thanks for your insightful suggestions Bulat, I'll certainly make use of them

    2. I believe to have one proposition that can match this recommendation.

      Currently, the WindowLog value is a 4-bits field (0-15), ranging from 4KB to 128MB by x2 increment.

      The proposition would be to extend that amount to 8-bits, making for a single byte.

      On this total, 5-bits give the power of 2, while 3-bits provide the "fractional" part. Exponent-mantissa if you wish.

      So it would be possible to specify 10 KB as 8 KB + 2/8th, for example. It's more precise, but not completely precise (doesn't allow every value).

      The 5-bits for exponent allow very large values. Almost ridiculously high. Keeping a baseline of 4 KB, it means the highest value (31) translates into 2<<12+31 == ~8 TB. So it might be preferable to use this extra room to lower the minimum value, for example to 1 KB.

      In complement, it would be possible to say "use the frame content size instead", allowing precise selection of any buffer size that can be expressed within 8 bytes (~16 EB). In which case, the "window" byte would simply not be necessary (I rather see this feature interesting when the content size is extremely small, such as ~100 bytes).

      That also means the decoder must defend itself against ridiculous memory requirement, so any decoder must have an implementation-specific limit and refuse to decode any frame with too high requirement.
      Currently, such limit is 32 MB for 32-bits and 128 MB for 64-bits version of zstd. But since it is implementation dependent, it could be extended in a future version, without jeopardizing the specification.

  2. Dictionary ID : can't a checksum computed on the used dictionary be used for the dictionary ID ?
    As far as I understand, this would give it all the necessary properties, and probably get rid of any collision possibilities without any burdensome IDs housekeeping.

    1. That's indeed the plan :
      by default, a dictionary ID will be a 4-bytes "random" number, but the "random" will be in fact a hash of dictionary content.

      That looks fine for 4-bytes.
      The problem is for people which consider that 4-bytes is too much for an ID. That's indeed very small, but when the goal is to compress a lot of very small packets, ending in the ~50 bytes range, 4-bytes is almost 10% ...

      2-bytes can still be fine with "random" methodology for a handful of dictionaries, but not much more.

      1-byte is not really compatible with random ID, accidental collision probability is too large.

      So, as an option, it will also be possible to manually specify which ID should have a dictionary.
      This will be an advanced option, for people who know what they are doing.

  3. On the content checksum:

    Have you considered using CRC-32C: the Castagnoli polynomial, used by iSCSI et al, implemented in modern x86 hardware instruction set?

    It's faster, and also better at catching random errors. Using the h/w instruction, I believe the speed is about 20GB/s.

    Hashes catch any error with a frequency of 2^n-1 / 2^n. CRC-32C would also catch:

    1. All burst errors up to 32-bits.
    2. All one-bit errors.
    3. All multiple-bit errors up to the Hamming Distance. The message size for this application is 1Mbit. I haven't seen a Hamming Distance calculation for CRC-32C for messages that large. For 128kbit, the Hamming Distance is 4 [ref 1]. I'd guess for 1Mbit it'd be 2 (catch all 2-bit errors), but that's just a guess.

    If your data parser works on 3-byte words, I suppose you can serialize the CRC32-C in two words.

    Since with modern hardware it's so cheap to calculate, and the common case of using the codec is with large blocks, I think the default should be to enable content checksumming, and the library user only turns it off for the rarer cases of working with small blocks.

    [ref 1]

    1. Good points Daniel.
      An issue with CRC-32C is having excellent performance tied to Intel hardware. Outside of it, though, it's less clear.
      What about other SoC, such as ARM for example ? or mips, powerpc, and such ?

      Considering the objective of portability, it's mandatory to provide a fast software backup to cover these systems.
      Alas, last time I checked, there was no fast software backup competitive with xxHash. Could have changed though.

      CRC have the advantage to guarantee finding errors below a certain threshold. But only if the full 32-bits crc result is stored. There is no such guarantee on a subset of it (which is a pity here as the plan is to store 22 bits only).
      Another drawback is that above the error threshold, collision rate is in fact *higher*. Nothing dramatic nor unexpected.

      So a question is, in how many cases are the errors below the threshold ?
      That's a hard one.
      My feeling is that nowadays, hardware is full of error correction codes already, be it on the storage side, or the transmission side. That means, data is either clean, or completely shambled.
      This goes against the need for a detector specialized in "small" errors.

      I see however some software scenarios where it could happen. For example, a write operation which would reach beyond assigned buffer and pollute the next one. That could result in a limited error of just a few bytes.
      Though it's just one use case, and likely not the most important one.

  4. I have run into content errors before, but it's typically on very large files downloaded by chrome...hmm...almost hate to waste the space if it isn't required, maybe make the first 8 bits be "zero or one" to tell you if the rest is a checksum or not?