compress/flate: improve compression speed #88

austinderek · 2025-09-27T13:05:55Z

This improves the compression speed of the flate package.
This is a cleaned version of github.com/klauspost/compress/flate

Overall changes:

Compression level 2-6 are custom implementations.
Compression level 7-9 tweaked to match levels 2-6 with minor improvements.
Tokens are encoded and indexed when added.
Huffman encoding attempts to continue blocks instead of always starting new one.
Loads/Stores in separate functions and can be made to use unsafe.

In overall terms this attempts to better balance out the compression levels,
which tended to have little spread in the top levels.

The intention is to place "default" at the place where performance drops off
considerably without a proportional improvement in compression ratio.
In my package I have set "5" to be the default, but this keeps it at level 6.

"Unsafe" operations have been removed for now.
They can trivially be added back.
This is an approximately 10% speed penalty.

There are built-in benchmarks using the standard library's benchmark below.
I do not think this is a particular good representation of different
data types, so I have also done benchmarks on various data types.

I have compiled the benchmarks on https://stdeflate.klauspost.com/

The main focus has been on level 1 (fastest),
level 5+6 (default) and level 9 (smallest).
It is quite rare that levels outside of this are used, but they should still
fit their role reasonably.

Level 9 will attempt more aggressive compression,
but will also typically be slightly slower than before.

I hope the graphs above shows that focusing on a few data types
doesn't always give the full picture.

My own observations:

Level 1 and 2 are often "trading places" depending on data type.
Since level 1 is usually the lowest compressing of the two -
mostly slightly faster, with lower memory usage -
it is placed as the lowest.

The switchover between level 6 and 7 is not always smooth,
since the search method changes significantly.

Random data is now ~100x faster on levels 2-6, and ~3 faster on levels 7-9.
You can feed pre-compressed data with no significant speed penalty.

benchmark BenchmarkEncode/Digits/Huffman/1e4-32 BenchmarkEncode/Digits/Huffman/1e5-32 BenchmarkEncode/Digits/Huffman/1e6-32 BenchmarkEncode/Digits/Speed/1e4-32 BenchmarkEncode/Digits/Speed/1e5-32 BenchmarkEncode/Digits/Speed/1e6-32 BenchmarkEncode/Digits/Default/1e4-32 BenchmarkEncode/Digits/Default/1e5-32 BenchmarkEncode/Digits/Default/1e6-32 BenchmarkEncode/Digits/Compression/1e4-32 BenchmarkEncode/Digits/Compression/1e5-32 BenchmarkEncode/Digits/Compression/1e6-32 BenchmarkEncode/Newton/Huffman/1e4-32 BenchmarkEncode/Newton/Huffman/1e5-32 BenchmarkEncode/Newton/Huffman/1e6-32 BenchmarkEncode/Newton/Speed/1e4-32 BenchmarkEncode/Newton/Speed/1e5-32 BenchmarkEncode/Newton/Speed/1e6-32 BenchmarkEncode/Newton/Default/1e4-32 BenchmarkEncode/Newton/Default/1e5-32 BenchmarkEncode/Newton/Default/1e6-32 BenchmarkEncode/Newton/Compression/1e4-32 BenchmarkEncode/Newton/Compression/1e5-32 BenchmarkEncode/Newton/Compression/1e6-32 old ns/op new ns/op delta
11431 8001 -30.01%
123175 74780 -39.29%
1260402 750022 -40.49%
35100 23758 -32.31%
675355 385954 -42.85%
6878375 4873784 -29.14%
63411 40974 -35.38%
1815762 801563 -55.86%
18875894 8101836 -57.08%
63859 85275 +33.54%
1803745 2752174 +52.58%
18931995 30727403 +62.30%
15770 11108 -29.56%
134567 85103 -36.76%
1663889 1030186 -38.09%
32749 22934 -29.97%
565609 336750 -40.46%
5996011 3815437 -36.37%
70505 34148 -51.57%
2374066 570673 -75.96%
24562355 5975917 -75.67%
71505 77670 +8.62%
3345768 3730804 +11.51%
35770364 39768939 +11.18%

benchmark old MB/s new MB/s speedup
BenchmarkEncode/Digits/Huffman/1e4-32 874.80 1249.91 1.43x
BenchmarkEncode/Digits/Huffman/1e5-32 811.86 1337.25 1.65x
BenchmarkEncode/Digits/Huffman/1e6-32 793.40 1333.29 1.68x
BenchmarkEncode/Digits/Speed/1e4-32 284.90 420.91 1.48x
BenchmarkEncode/Digits/Speed/1e5-32 148.07 259.10 1.75x
BenchmarkEncode/Digits/Speed/1e6-32 145.38 205.18 1.41x
BenchmarkEncode/Digits/Default/1e4-32 157.70 244.06 1.55x
BenchmarkEncode/Digits/Default/1e5-32 55.07 124.76 2.27x
BenchmarkEncode/Digits/Default/1e6-32 52.98 123.43 2.33x
BenchmarkEncode/Digits/Compression/1e4-32 156.59 117.27 0.75x
BenchmarkEncode/Digits/Compression/1e5-32 55.44 36.33 0.66x
BenchmarkEncode/Digits/Compression/1e6-32 52.82 32.54 0.62x
BenchmarkEncode/Newton/Huffman/1e4-32 634.13 900.25 1.42x
BenchmarkEncode/Newton/Huffman/1e5-32 743.12 1175.04 1.58x
BenchmarkEncode/Newton/Huffman/1e6-32 601.00 970.70 1.62x
BenchmarkEncode/Newton/Speed/1e4-32 305.35 436.03 1.43x
BenchmarkEncode/Newton/Speed/1e5-32 176.80 296.96 1.68x
BenchmarkEncode/Newton/Speed/1e6-32 166.78 262.09 1.57x
BenchmarkEncode/Newton/Default/1e4-32 141.83 292.84 2.06x
BenchmarkEncode/Newton/Default/1e5-32 42.12 175.23 4.16x
BenchmarkEncode/Newton/Default/1e6-32 40.71 167.34 4.11x
BenchmarkEncode/Newton/Compression/1e4-32 139.85 128.75 0.92x
BenchmarkEncode/Newton/Compression/1e5-32 29.89 26.80 0.90x
BenchmarkEncode/Newton/Compression/1e6-32 27.96 25.15 0.90x

Static Memory Usage:

Before:
Level -2: Memory Used: 704KB, 8 allocs
Level -1: Memory Used: 776KB, 7 allocs
Level 0: Memory Used: 704KB, 7 allocs
Level 1: Memory Used: 1160KB, 13 allocs
Level 2: Memory Used: 776KB, 8 allocs
Level 3: Memory Used: 776KB, 8 allocs
Level 4: Memory Used: 776KB, 8 allocs
Level 5: Memory Used: 776KB, 8 allocs
Level 6: Memory Used: 776KB, 8 allocs
Level 7: Memory Used: 776KB, 8 allocs
Level 8: Memory Used: 776KB, 9 allocs
Level 9: Memory Used: 776KB, 8 allocs

After:
Level -2: Memory Used: 272KB, 12 allocs
Level -1: Memory Used: 1016KB, 7 allocs
Level 0: Memory Used: 304KB, 6 allocs
Level 1: Memory Used: 760KB, 13 allocs
Level 2: Memory Used: 1144KB, 8 allocs
Level 3: Memory Used: 1144KB, 8 allocs
Level 4: Memory Used: 888KB, 14 allocs
Level 5: Memory Used: 1016KB, 8 allocs
Level 6: Memory Used: 1016KB, 8 allocs
Level 7: Memory Used: 952KB, 7 allocs
Level 8: Memory Used: 952KB, 7 allocs
Level 9: Memory Used: 1080KB, 9 allocs

This package has been fuzz tested for about 24 hours.
Currently, there is about 1h between new "interesting" finds.

Change-Id: Icb4c9839dc8f1bb96fd6d548038679a7554a559b

🔄 This is a mirror of upstream PR golang#75624

…speed Fixes golang#75532 This improves the compression speed of the flate package. This is a cleaned version of github.com/klauspost/compress/flate Overall changes: * Compression level 2-6 are custom implementations. * Compression level 7-9 tweaked to match levels 2-6 with minor improvements. * Tokens are encoded and indexed when added. * Huffman encoding attempts to continue blocks instead of always starting a new one. * Loads/Stores in separate functions and can be made to use unsafe. In overall terms this attempts to better balance out the compression levels, which tended to have little spread in the top levels. The intention is to place "default" at the place where performance drops off considerably without a proportional improvement in compression ratio. In my package I have set "5" to be the default, but this keeps it at level 6. There are built-in benchmarks using the standard library's benchmark below. I do not think this is a particular good representation of different data types, so I have also done benchmarks on various data types. I have compiled the benchmarks on https://stdeflate.klauspost.com/ The main focus has been on level 1 (fastest), level 5+6 (default) and level 9 (smallest). It is quite rare that levels outside of this are used, but they should still fit their role reasonably. Level 9 will attempt more aggressive compression, but will also typically be slightly slower than before. I hope the graphs above shows that focusing on a few data types doesn't always give the full picture. My own observations: Level 1 and 2 are often "trading places" depending on data type. Since level 1 is usually the lowest compressing of the two - and mostly slightly faster, with lower memory usage - it is placed as the lowest. The switchover between level 6 and 7 is not always smooth, since the search method changes significantly. Random data is now ~100x faster on levels 2-6, and ~3 faster on levels 7-9. You can feed pre-compressed data with no significant speed penalty. "Unsafe" operations have been removed for now. They can trivially be added back. This is an approximately 10% speed penalty. benchmark old ns/op new ns/op delta BenchmarkEncode/Digits/Huffman/1e4-32 11431 8001 -30.01% BenchmarkEncode/Digits/Huffman/1e5-32 123175 74780 -39.29% BenchmarkEncode/Digits/Huffman/1e6-32 1260402 750022 -40.49% BenchmarkEncode/Digits/Speed/1e4-32 35100 23758 -32.31% BenchmarkEncode/Digits/Speed/1e5-32 675355 385954 -42.85% BenchmarkEncode/Digits/Speed/1e6-32 6878375 4873784 -29.14% BenchmarkEncode/Digits/Default/1e4-32 63411 40974 -35.38% BenchmarkEncode/Digits/Default/1e5-32 1815762 801563 -55.86% BenchmarkEncode/Digits/Default/1e6-32 18875894 8101836 -57.08% BenchmarkEncode/Digits/Compression/1e4-32 63859 85275 +33.54% BenchmarkEncode/Digits/Compression/1e5-32 1803745 2752174 +52.58% BenchmarkEncode/Digits/Compression/1e6-32 18931995 30727403 +62.30% BenchmarkEncode/Newton/Huffman/1e4-32 15770 11108 -29.56% BenchmarkEncode/Newton/Huffman/1e5-32 134567 85103 -36.76% BenchmarkEncode/Newton/Huffman/1e6-32 1663889 1030186 -38.09% BenchmarkEncode/Newton/Speed/1e4-32 32749 22934 -29.97% BenchmarkEncode/Newton/Speed/1e5-32 565609 336750 -40.46% BenchmarkEncode/Newton/Speed/1e6-32 5996011 3815437 -36.37% BenchmarkEncode/Newton/Default/1e4-32 70505 34148 -51.57% BenchmarkEncode/Newton/Default/1e5-32 2374066 570673 -75.96% BenchmarkEncode/Newton/Default/1e6-32 24562355 5975917 -75.67% BenchmarkEncode/Newton/Compression/1e4-32 71505 77670 +8.62% BenchmarkEncode/Newton/Compression/1e5-32 3345768 3730804 +11.51% BenchmarkEncode/Newton/Compression/1e6-32 35770364 39768939 +11.18% benchmark old MB/s new MB/s speedup BenchmarkEncode/Digits/Huffman/1e4-32 874.80 1249.91 1.43x BenchmarkEncode/Digits/Huffman/1e5-32 811.86 1337.25 1.65x BenchmarkEncode/Digits/Huffman/1e6-32 793.40 1333.29 1.68x BenchmarkEncode/Digits/Speed/1e4-32 284.90 420.91 1.48x BenchmarkEncode/Digits/Speed/1e5-32 148.07 259.10 1.75x BenchmarkEncode/Digits/Speed/1e6-32 145.38 205.18 1.41x BenchmarkEncode/Digits/Default/1e4-32 157.70 244.06 1.55x BenchmarkEncode/Digits/Default/1e5-32 55.07 124.76 2.27x BenchmarkEncode/Digits/Default/1e6-32 52.98 123.43 2.33x BenchmarkEncode/Digits/Compression/1e4-32 156.59 117.27 0.75x BenchmarkEncode/Digits/Compression/1e5-32 55.44 36.33 0.66x BenchmarkEncode/Digits/Compression/1e6-32 52.82 32.54 0.62x BenchmarkEncode/Newton/Huffman/1e4-32 634.13 900.25 1.42x BenchmarkEncode/Newton/Huffman/1e5-32 743.12 1175.04 1.58x BenchmarkEncode/Newton/Huffman/1e6-32 601.00 970.70 1.62x BenchmarkEncode/Newton/Speed/1e4-32 305.35 436.03 1.43x BenchmarkEncode/Newton/Speed/1e5-32 176.80 296.96 1.68x BenchmarkEncode/Newton/Speed/1e6-32 166.78 262.09 1.57x BenchmarkEncode/Newton/Default/1e4-32 141.83 292.84 2.06x BenchmarkEncode/Newton/Default/1e5-32 42.12 175.23 4.16x BenchmarkEncode/Newton/Default/1e6-32 40.71 167.34 4.11x BenchmarkEncode/Newton/Compression/1e4-32 139.85 128.75 0.92x BenchmarkEncode/Newton/Compression/1e5-32 29.89 26.80 0.90x BenchmarkEncode/Newton/Compression/1e6-32 27.96 25.15 0.90x Static Memory Usage: Before: Level -2: Memory Used: 704KB, 8 allocs Level -1: Memory Used: 776KB, 7 allocs Level 0: Memory Used: 704KB, 7 allocs Level 1: Memory Used: 1160KB, 13 allocs Level 2: Memory Used: 776KB, 8 allocs Level 3: Memory Used: 776KB, 8 allocs Level 4: Memory Used: 776KB, 8 allocs Level 5: Memory Used: 776KB, 8 allocs Level 6: Memory Used: 776KB, 8 allocs Level 7: Memory Used: 776KB, 8 allocs Level 8: Memory Used: 776KB, 9 allocs Level 9: Memory Used: 776KB, 8 allocs After: Level -2: Memory Used: 272KB, 12 allocs Level -1: Memory Used: 1016KB, 7 allocs Level 0: Memory Used: 304KB, 6 allocs Level 1: Memory Used: 760KB, 13 allocs Level 2: Memory Used: 1144KB, 8 allocs Level 3: Memory Used: 1144KB, 8 allocs Level 4: Memory Used: 888KB, 14 allocs Level 5: Memory Used: 1016KB, 8 allocs Level 6: Memory Used: 1016KB, 8 allocs Level 7: Memory Used: 952KB, 7 allocs Level 8: Memory Used: 952KB, 7 allocs Level 9: Memory Used: 1080KB, 9 allocs This package has been fuzz tested for about 24 hours. Currently, there is about 1h between new "interesting" finds. Change-Id: Icb4c9839dc8f1bb96fd6d548038679a7554a559b

Change-Id: I0ac5571da9585daba9491b360c9a6b4e0cecbcee

…table bytes. Change-Id: Ia141c7ec888bf51ceb6351d2a1c3f1501c2c4e12

Change-Id: I1cef87da8cf7a2f2b330115f8eeecb7bf825af76

austinderek force-pushed the master branch 3 times, most recently from af6999e to 37c78b5 Compare September 27, 2025 14:02

klauspost added 3 commits September 27, 2025 16:27

[klauspost/deflate-improve-comp] don't use internal/byteorder

374779b

Change-Id: I0ac5571da9585daba9491b360c9a6b4e0cecbcee

[klauspost/deflate-improve-comp] Remove hash7 and use const for long …

2c5d12a

…table bytes. Change-Id: Ia141c7ec888bf51ceb6351d2a1c3f1501c2c4e12

[klauspost/deflate-improve-comp] update expected zlib output

f09b893

Change-Id: I1cef87da8cf7a2f2b330115f8eeecb7bf825af76

austinderek force-pushed the master branch 23 times, most recently from 37c78b5 to 34e6762 Compare September 28, 2025 02:29

austinderek force-pushed the master branch 30 times, most recently from fe3ba74 to d17a28a Compare September 29, 2025 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

compress/flate: improve compression speed #88

compress/flate: improve compression speed #88

austinderek commented Sep 27, 2025

Uh oh!

Uh oh!

compress/flate: improve compression speed #88

Are you sure you want to change the base?

compress/flate: improve compression speed #88

Conversation

austinderek commented Sep 27, 2025

Uh oh!

Uh oh!