Fix Deflate Quick when Resuming Due to Filled Output Buffer #14

pvachon · 2016-05-17T13:47:17Z

Under certain circumstances (i.e. where the output buffer is smaller than the size of the compressed output, and the output size is greater than the pending buffer's size), deflate_quick() terminates iterating without closing the block it is currently operating on. If there is further input to be processed, when deflate_quick() resumed it would emit a spurious symbol, which would confuse the inflate() state machine. This would result in corrupted data on output from deflate(), and eventually a failure return code.

This changeset also hoists the check for sufficient pending buffer space out of the quick_send_bits() function, since it was possible to see cases where there was insufficient pending buffer space that resulted in a silent failure.

This is part of the way to fixing #10 -- there's still some idiom in git that that causes deflate_quick() to break under certain circumstances.

This allows for per-architecture build tuning.

Adds check for SSE2, SSE4.2, and the PCLMULQDQ instructions.

Excessive loop unrolling is detrimental to performance. This patch adds a preprocessor define, ADLER32_UNROLL_LESS, to reduce unrolling factor from 16 to 8. Updates configure script to set as default on x86

Separates the byte-by-byte and short-by-short longest_match implementations into two separately tweakable versions and splits all of the longest match functions into a separate file. Split the end-chain and early-chain scans and provide likely/unlikely hints to improve branh prediction. Add an early termination condition for levels 5 and under to stop iterating the hash chain when the match length for the current entry is less than the current best match. Also adjust variable types and scopes to provide better optimization hints to the compiler.

Adds a preprocessor define, CRC32_UNROLL_LESS, to reduce unrolling factor from 8 to 4 for the crc32 calculation. Updates configure script to set as default on x86

Uses SSE2 subtraction with saturation to shift the hash in 16B chunks. Renames the old fill_window implementation to fill_window_c(), and adds a new fill_window_sse() implementation in fill_window_sse.c. Moves UPDATE_HASH into deflate.h and changes the scope of read_buf from local to ZLIB_INTERNAL for sharing between the two implementations. Updates the configure script to check for SSE2 intrinsics and enables this optimization by default on x86. The runtime check for SSE2 support only occurs on 32-bit, as x86_64 requires SSE2. Adds an explicit rule in Makefile.in to build fill_window_sse.c with the -msse2 compiler flag, which is required for SSE2 intrinsics.

For systems supporting SSE4.2, use the crc32 instruction as a fast hash function. Also, provide a better fallback hash. For both new hash functions, we hash 4 bytes, instead of 3, for certain levels. This shortens the hash chains, and also improves the quality of each hash entry.

Rather than copy the input data from strm->next_in into the window and then compute the CRC, this patch combines these two steps into one. It performs a SSE memory copy, while folding the data down in the SSE registers. A final step is added, when we write the gzip trailer, to reduce the 4 SSE registers to 32b. Adds some extra padding bytes to the window to allow for SSE partial writes.

The deflate_quick strategy is designed to provide maximum deflate performance. deflate_quick achieves this through: - only checking the first hash match - using a small inline SSE4.2-optimized longest_match - forcing a window size of 8K, and using a precomputed dist/len table - forcing the static Huffman tree and emitting codes immediately instead of tallying This patch changes the scope of flush_pending, bi_windup, and static_ltree to ZLIB_INTERNAL and moves END_BLOCK, send_code, put_short, and send_bits to deflate.h. Updates the configure script to enable by default for x86. On systems without SSE4.2, fallback is to deflate_fast strategy. Fixes intel#6 Fixes intel#8

From: Arjan van de Ven <[email protected]> As the name suggests, the deflate_medium deflate strategy is designed to provide an intermediate strategy between deflate_fast and deflate_slow. After finding two adjacent matches, deflate_medium scans left from the second match in order to determine whether a better match can be formed. Fixes intel#2

(Note emit_match() doesn't currently use the value at all.) Fixes intel#4

…pilation.

…late_quick.c.

…fill_window_sse.

When using deflate_quick() in a streaming fashion and the output buffer runs out of space while the input buffer still has data, deflate_quick() would emit partial symbols. Force the deflate_quick() loop to terminate for a flush before any further processing is done, returning to the main deflate() routine to do its thing.

By storing whether or not a block has been opened (or terminated), the static trees used for the block and the end block markers can be emitted appropriately.

On deflation context creation, initialize the block_open state to 0 to ensure that no uninitialized values are used.

jtkukunas and others added 18 commits December 13, 2013 09:28

Add architecture detection in configure script.

1af4192

This allows for per-architecture build tuning.

For x86, add CPUID check.

d24da7c

Adds check for SSE2, SSE4.2, and the PCLMULQDQ instructions.

enable 16-bit longest_match for x86

99999a8

Add preprocessor define to tune Adler32 loop unrolling.

fad00ea

Excessive loop unrolling is detrimental to performance. This patch adds a preprocessor define, ADLER32_UNROLL_LESS, to reduce unrolling factor from 16 to 8. Updates configure script to set as default on x86

Add preprocessor define to tune crc32 unrolling.

fd80ca4

Adds a preprocessor define, CRC32_UNROLL_LESS, to reduce unrolling factor from 8 to 4 for the crc32 calculation. Updates configure script to set as default on x86

deflate: avoid use of uninitialized variable

86694e8

(Note emit_match() doesn't currently use the value at all.) Fixes intel#4

Include wmmintrin.h in configure test and crc_folding.c aid clang com…

308be56

…pilation.

Add forward declarations for fill_window_sse and flush_pending to def…

ed145f4

…late_quick.c.

Add crc_ forward declarations to deflate and add read_buf fwd dcl to …

e176b3c

…fill_window_sse.

Add block_open state for deflate_quick

4316869

By storing whether or not a block has been opened (or terminated), the static trees used for the block and the end block markers can be emitted appropriately.

Initialize block_open state

d4cd963

On deflation context creation, initialize the block_open state to 0 to ensure that no uninitialized values are used.

pvachon mentioned this pull request May 18, 2016

deflate error when using git #10

Open

Dead2 mentioned this pull request Jan 18, 2017

Invalid gzip stream produced for compression level = 1 zlib-ng/zlib-ng#81

Closed

jtkukunas force-pushed the master branch from 4b9e3f0 to 641f59e Compare June 21, 2018 23:03

jtkukunas force-pushed the master branch from 6169794 to 1ac4c01 Compare April 15, 2022 00:07

jtkukunas force-pushed the master branch from 33f8f35 to bf55d56 Compare April 25, 2022 15:35

jtkukunas force-pushed the master branch from bf55d56 to b22695e Compare August 29, 2022 15:39

busykai force-pushed the master branch from b22695e to 6160a8f Compare November 30, 2022 06:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Deflate Quick when Resuming Due to Filled Output Buffer #14

Fix Deflate Quick when Resuming Due to Filled Output Buffer #14

Uh oh!

pvachon commented May 17, 2016

Uh oh!

Uh oh!

Fix Deflate Quick when Resuming Due to Filled Output Buffer #14

Are you sure you want to change the base?

Fix Deflate Quick when Resuming Due to Filled Output Buffer #14

Uh oh!

Conversation

pvachon commented May 17, 2016

Uh oh!

Uh oh!