Inflating zlib, gzip and deflate in Ruby16 Jun 2020
zlib is a C library that can (de)compress data in three formats:
- raw DEFLATE streams, as documented in RFC1951
- zlib streams, as documened in RFC1950. These are a DEFLATE stream, with a 2-byte header prefix and 4-byte footer suffix
- gzip streams, as documented in RFC1952. These are a DEFLATE stream, with a header prefix slightly larger than 2-bytes.
The RFCs were published in the mid 1990s, and the DEFLATE format pre-dates them by a few years. These formats (particularly RFC1952 gzip streams) are super common, thanks to being utilised in various subsequent internet standards (like HTTP).
The ruby standard library conveniently includes bindings, and decompressing all three formats is quick:
The internet is full of suggestions for the correct
WINDOW_BITS magic number
in various situations, but how do you pick the right one (other than just
trying them all until one works)?
Why the magic numbers?
WINDOW_BITS is intended to indicate the size of the window zlib should use,
where the window is equal to 2^WINDOW_BITS bytes. The window can be 256 - 32,768 bytes
(32Kb), so the normal range of
WINDOW_BITS is 8-15. In practice with the
memory we have in 2020 computers, 15 is the most common value.
It seems the zlib authors have overloaded the argument, allowing us to request extra behaviour by using numbers outside the range that makes sense for windows.
… and what is a window anyway?
When deflating, a larger window instructs zlib to search a larger area of the
plaintext (up to 32Kb backwards) for repeated patterns. A larger window will
result in better compression, at the cost of more memory consumption. In
practice with the memory we have in 2020 computers, 32Kb is tiny and
WINDOW_BITS of 15 is the default value.
When inflating, manually setting the window size isn’t very useful. Inflating
requires a window at least as large as the one used to deflate a stream, and
the default behaviour of always allocating 32Kb is rarely a problem. Setting
WINDOW_BITS to a value outside the 8-15 range to specify formats is likely
the only time you’ll need to use it.