Skip to content
This repository was archived by the owner on Mar 1, 2024. It is now read-only.

Fix Deflate Quick when Resuming Due to Filled Output Buffer #14

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

pvachon
Copy link

@pvachon pvachon commented May 17, 2016

Under certain circumstances (i.e. where the output buffer is smaller than the size of the compressed output, and the output size is greater than the pending buffer's size), deflate_quick() terminates iterating without closing the block it is currently operating on. If there is further input to be processed, when deflate_quick() resumed it would emit a spurious symbol, which would confuse the inflate() state machine. This would result in corrupted data on output from deflate(), and eventually a failure return code.

This changeset also hoists the check for sufficient pending buffer space out of the quick_send_bits() function, since it was possible to see cases where there was insufficient pending buffer space that resulted in a silent failure.

This is part of the way to fixing #10 -- there's still some idiom in git that that causes deflate_quick() to break under certain circumstances.

jtkukunas and others added 18 commits December 13, 2013 09:28
This allows for per-architecture build tuning.
Adds check for SSE2, SSE4.2, and the PCLMULQDQ instructions.
Excessive loop unrolling is detrimental to performance. This patch
adds a preprocessor define, ADLER32_UNROLL_LESS, to reduce unrolling
factor from 16 to 8.

Updates configure script to set as default on x86
Separates the byte-by-byte and short-by-short longest_match
implementations into two separately tweakable versions and
splits all of the longest match functions into a separate file.

Split the end-chain and early-chain scans and provide likely/unlikely
hints to improve branh prediction.

Add an early termination condition for levels 5 and under to stop
iterating the hash chain when the match length for the current
entry is less than the current best match.

Also adjust variable types and scopes to provide better optimization
hints to the compiler.
Adds a preprocessor define, CRC32_UNROLL_LESS, to reduce unrolling
factor from 8 to 4 for the crc32 calculation.

Updates configure script to set as default on x86
Uses SSE2 subtraction with saturation to shift the hash in
16B chunks. Renames the old fill_window implementation to
fill_window_c(), and adds a new fill_window_sse() implementation
in fill_window_sse.c.

Moves UPDATE_HASH into deflate.h and changes the scope of
read_buf from local to ZLIB_INTERNAL for sharing between
the two implementations.

Updates the configure script to check for SSE2 intrinsics and enables
this optimization by default on x86. The runtime check for SSE2 support
only occurs on 32-bit, as x86_64 requires SSE2. Adds an explicit
rule in Makefile.in to build fill_window_sse.c with the -msse2 compiler
flag, which is required for SSE2 intrinsics.
For systems supporting SSE4.2, use the crc32 instruction as a fast
hash function. Also, provide a better fallback hash.

For both new hash functions, we hash 4 bytes, instead of 3, for certain
levels. This shortens the hash chains, and also improves the quality
of each hash entry.
Rather than copy the input data from strm->next_in into the window and
then compute the CRC, this patch combines these two steps into one. It
performs a SSE memory copy, while folding the data down in the SSE
registers. A final step is added, when we write the gzip trailer,
to reduce the 4 SSE registers to 32b.

Adds some extra padding bytes to the window to allow for SSE partial
writes.
The deflate_quick strategy is designed to provide maximum
deflate performance.

deflate_quick achieves this through:
    - only checking the first hash match
    - using a small inline SSE4.2-optimized longest_match
    - forcing a window size of 8K, and using a precomputed dist/len
      table
    - forcing the static Huffman tree and emitting codes immediately
      instead of tallying

This patch changes the scope of flush_pending, bi_windup, and
static_ltree to ZLIB_INTERNAL and moves END_BLOCK, send_code,
put_short, and send_bits to deflate.h.

Updates the configure script to enable by default for x86. On systems
without SSE4.2, fallback is to deflate_fast strategy.

Fixes intel#6
Fixes intel#8
From: Arjan van de Ven <[email protected]>

As the name suggests, the deflate_medium deflate strategy is designed
to provide an intermediate strategy between deflate_fast and deflate_slow.
After finding two adjacent matches, deflate_medium scans left from
the second match in order to determine whether a better match can be
formed.

Fixes intel#2
(Note emit_match() doesn't currently use the value at all.)

Fixes intel#4
When using deflate_quick() in a streaming fashion and the output buffer
runs out of space while the input buffer still has data, deflate_quick()
would emit partial symbols. Force the deflate_quick() loop to terminate
for a flush before any further processing is done, returning to the main
deflate() routine to do its thing.
By storing whether or not a block has been opened (or terminated), the
static trees used for the block and the end block markers can be emitted
appropriately.
On deflation context creation, initialize the block_open state to 0 to
ensure that no uninitialized values are used.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants