Skip to content

Conversation

austinderek
Copy link

Fixes golang#75532

This improves the compression speed of the flate package.
This is a cleaned version of github.com/klauspost/compress/flate

Overall changes:

  • Compression level 2-6 are custom implementations.
  • Compression level 7-9 tweaked to match levels 2-6 with minor improvements.
  • Tokens are encoded and indexed when added.
  • Huffman encoding attempts to continue blocks instead of always starting new one.
  • Loads/Stores in separate functions and can be made to use unsafe.

In overall terms this attempts to better balance out the compression levels,
which tended to have little spread in the top levels.

The intention is to place "default" at the place where performance drops off
considerably without a proportional improvement in compression ratio.
In my package I have set "5" to be the default, but this keeps it at level 6.

"Unsafe" operations have been removed for now.
They can trivially be added back.
This is an approximately 10% speed penalty.

There are built-in benchmarks using the standard library's benchmark below.
I do not think this is a particular good representation of different
data types, so I have also done benchmarks on various data types.

I have compiled the benchmarks on https://stdeflate.klauspost.com/

The main focus has been on level 1 (fastest),
level 5+6 (default) and level 9 (smallest).
It is quite rare that levels outside of this are used, but they should still
fit their role reasonably.

Level 9 will attempt more aggressive compression,
but will also typically be slightly slower than before.

I hope the graphs above shows that focusing on a few data types
doesn't always give the full picture.

My own observations:

Level 1 and 2 are often "trading places" depending on data type.
Since level 1 is usually the lowest compressing of the two -
mostly slightly faster, with lower memory usage -
it is placed as the lowest.

The switchover between level 6 and 7 is not always smooth,
since the search method changes significantly.

Random data is now ~100x faster on levels 2-6, and ~3 faster on levels 7-9.
You can feed pre-compressed data with no significant speed penalty.

benchmark old ns/op new ns/op delta
BenchmarkEncode/Digits/Huffman/1e4-32 11431 8001 -30.01%
BenchmarkEncode/Digits/Huffman/1e5-32 123175 74780 -39.29%
BenchmarkEncode/Digits/Huffman/1e6-32 1260402 750022 -40.49%
BenchmarkEncode/Digits/Speed/1e4-32 35100 23758 -32.31%
BenchmarkEncode/Digits/Speed/1e5-32 675355 385954 -42.85%
BenchmarkEncode/Digits/Speed/1e6-32 6878375 4873784 -29.14%
BenchmarkEncode/Digits/Default/1e4-32 63411 40974 -35.38%
BenchmarkEncode/Digits/Default/1e5-32 1815762 801563 -55.86%
BenchmarkEncode/Digits/Default/1e6-32 18875894 8101836 -57.08%
BenchmarkEncode/Digits/Compression/1e4-32 63859 85275 +33.54%
BenchmarkEncode/Digits/Compression/1e5-32 1803745 2752174 +52.58%
BenchmarkEncode/Digits/Compression/1e6-32 18931995 30727403 +62.30%
BenchmarkEncode/Newton/Huffman/1e4-32 15770 11108 -29.56%
BenchmarkEncode/Newton/Huffman/1e5-32 134567 85103 -36.76%
BenchmarkEncode/Newton/Huffman/1e6-32 1663889 1030186 -38.09%
BenchmarkEncode/Newton/Speed/1e4-32 32749 22934 -29.97%
BenchmarkEncode/Newton/Speed/1e5-32 565609 336750 -40.46%
BenchmarkEncode/Newton/Speed/1e6-32 5996011 3815437 -36.37%
BenchmarkEncode/Newton/Default/1e4-32 70505 34148 -51.57%
BenchmarkEncode/Newton/Default/1e5-32 2374066 570673 -75.96%
BenchmarkEncode/Newton/Default/1e6-32 24562355 5975917 -75.67%
BenchmarkEncode/Newton/Compression/1e4-32 71505 77670 +8.62%
BenchmarkEncode/Newton/Compression/1e5-32 3345768 3730804 +11.51%
BenchmarkEncode/Newton/Compression/1e6-32 35770364 39768939 +11.18%

benchmark old MB/s new MB/s speedup
BenchmarkEncode/Digits/Huffman/1e4-32 874.80 1249.91 1.43x
BenchmarkEncode/Digits/Huffman/1e5-32 811.86 1337.25 1.65x
BenchmarkEncode/Digits/Huffman/1e6-32 793.40 1333.29 1.68x
BenchmarkEncode/Digits/Speed/1e4-32 284.90 420.91 1.48x
BenchmarkEncode/Digits/Speed/1e5-32 148.07 259.10 1.75x
BenchmarkEncode/Digits/Speed/1e6-32 145.38 205.18 1.41x
BenchmarkEncode/Digits/Default/1e4-32 157.70 244.06 1.55x
BenchmarkEncode/Digits/Default/1e5-32 55.07 124.76 2.27x
BenchmarkEncode/Digits/Default/1e6-32 52.98 123.43 2.33x
BenchmarkEncode/Digits/Compression/1e4-32 156.59 117.27 0.75x
BenchmarkEncode/Digits/Compression/1e5-32 55.44 36.33 0.66x
BenchmarkEncode/Digits/Compression/1e6-32 52.82 32.54 0.62x
BenchmarkEncode/Newton/Huffman/1e4-32 634.13 900.25 1.42x
BenchmarkEncode/Newton/Huffman/1e5-32 743.12 1175.04 1.58x
BenchmarkEncode/Newton/Huffman/1e6-32 601.00 970.70 1.62x
BenchmarkEncode/Newton/Speed/1e4-32 305.35 436.03 1.43x
BenchmarkEncode/Newton/Speed/1e5-32 176.80 296.96 1.68x
BenchmarkEncode/Newton/Speed/1e6-32 166.78 262.09 1.57x
BenchmarkEncode/Newton/Default/1e4-32 141.83 292.84 2.06x
BenchmarkEncode/Newton/Default/1e5-32 42.12 175.23 4.16x
BenchmarkEncode/Newton/Default/1e6-32 40.71 167.34 4.11x
BenchmarkEncode/Newton/Compression/1e4-32 139.85 128.75 0.92x
BenchmarkEncode/Newton/Compression/1e5-32 29.89 26.80 0.90x
BenchmarkEncode/Newton/Compression/1e6-32 27.96 25.15 0.90x

Static Memory Usage:

Before:
Level -2: Memory Used: 704KB, 8 allocs
Level -1: Memory Used: 776KB, 7 allocs
Level 0: Memory Used: 704KB, 7 allocs
Level 1: Memory Used: 1160KB, 13 allocs
Level 2: Memory Used: 776KB, 8 allocs
Level 3: Memory Used: 776KB, 8 allocs
Level 4: Memory Used: 776KB, 8 allocs
Level 5: Memory Used: 776KB, 8 allocs
Level 6: Memory Used: 776KB, 8 allocs
Level 7: Memory Used: 776KB, 8 allocs
Level 8: Memory Used: 776KB, 9 allocs
Level 9: Memory Used: 776KB, 8 allocs

After:
Level -2: Memory Used: 272KB, 12 allocs
Level -1: Memory Used: 1016KB, 7 allocs
Level 0: Memory Used: 304KB, 6 allocs
Level 1: Memory Used: 760KB, 13 allocs
Level 2: Memory Used: 1144KB, 8 allocs
Level 3: Memory Used: 1144KB, 8 allocs
Level 4: Memory Used: 888KB, 14 allocs
Level 5: Memory Used: 1016KB, 8 allocs
Level 6: Memory Used: 1016KB, 8 allocs
Level 7: Memory Used: 952KB, 7 allocs
Level 8: Memory Used: 952KB, 7 allocs
Level 9: Memory Used: 1080KB, 9 allocs

This package has been fuzz tested for about 24 hours.
Currently, there is about 1h between new "interesting" finds.

Change-Id: Icb4c9839dc8f1bb96fd6d548038679a7554a559b


🔄 This is a mirror of upstream PR golang#75624

…speed

Fixes golang#75532

This improves the compression speed of the flate package.
This is a cleaned version of github.com/klauspost/compress/flate

Overall changes:

* Compression level 2-6 are custom implementations.
* Compression level 7-9 tweaked to match levels 2-6 with minor improvements.
* Tokens are encoded and indexed when added.
* Huffman encoding attempts to continue blocks instead of always starting a new one.
* Loads/Stores in separate functions and can be made to use unsafe.

In overall terms this attempts to better balance out the compression levels,
which tended to have little spread in the top levels.

The intention is to place "default" at the place where performance drops off
considerably without a proportional improvement in compression ratio.
In my package I have set "5" to be the default, but this keeps it at level 6.

There are built-in benchmarks using the standard library's benchmark below.
I do not think this is a particular good representation of different
data types, so I have also done benchmarks on various data types.

I have compiled the benchmarks on https://stdeflate.klauspost.com/

The main focus has been on level 1 (fastest), level 5+6 (default) and level 9 (smallest).
It is quite rare that levels outside of this are used, but they should still
fit their role reasonably.

Level 9 will attempt more aggressive compression, but will also typically be slightly
slower than before.

I hope the graphs above shows that focusing on a few data types doesn't always give
the full picture.

My own observations:

Level 1 and 2 are often "trading places" depending on data type. Since level 1 is usually
the lowest compressing of the two - and mostly slightly faster, with lower memory usage -
it is placed as the lowest.

The switchover between level 6 and 7 is not always smooth,
since the search method changes significantly.

Random data is now ~100x faster on levels 2-6, and ~3 faster on levels 7-9.
You can feed pre-compressed data with no significant speed penalty.

"Unsafe" operations have been removed for now. They can trivially be added back.
This is an approximately 10% speed penalty.

benchmark                                     old ns/op     new ns/op     delta
BenchmarkEncode/Digits/Huffman/1e4-32         11431         8001          -30.01%
BenchmarkEncode/Digits/Huffman/1e5-32         123175        74780         -39.29%
BenchmarkEncode/Digits/Huffman/1e6-32         1260402       750022        -40.49%
BenchmarkEncode/Digits/Speed/1e4-32           35100         23758         -32.31%
BenchmarkEncode/Digits/Speed/1e5-32           675355        385954        -42.85%
BenchmarkEncode/Digits/Speed/1e6-32           6878375       4873784       -29.14%
BenchmarkEncode/Digits/Default/1e4-32         63411         40974         -35.38%
BenchmarkEncode/Digits/Default/1e5-32         1815762       801563        -55.86%
BenchmarkEncode/Digits/Default/1e6-32         18875894      8101836       -57.08%
BenchmarkEncode/Digits/Compression/1e4-32     63859         85275         +33.54%
BenchmarkEncode/Digits/Compression/1e5-32     1803745       2752174       +52.58%
BenchmarkEncode/Digits/Compression/1e6-32     18931995      30727403      +62.30%
BenchmarkEncode/Newton/Huffman/1e4-32         15770         11108         -29.56%
BenchmarkEncode/Newton/Huffman/1e5-32         134567        85103         -36.76%
BenchmarkEncode/Newton/Huffman/1e6-32         1663889       1030186       -38.09%
BenchmarkEncode/Newton/Speed/1e4-32           32749         22934         -29.97%
BenchmarkEncode/Newton/Speed/1e5-32           565609        336750        -40.46%
BenchmarkEncode/Newton/Speed/1e6-32           5996011       3815437       -36.37%
BenchmarkEncode/Newton/Default/1e4-32         70505         34148         -51.57%
BenchmarkEncode/Newton/Default/1e5-32         2374066       570673        -75.96%
BenchmarkEncode/Newton/Default/1e6-32         24562355      5975917       -75.67%
BenchmarkEncode/Newton/Compression/1e4-32     71505         77670         +8.62%
BenchmarkEncode/Newton/Compression/1e5-32     3345768       3730804       +11.51%
BenchmarkEncode/Newton/Compression/1e6-32     35770364      39768939      +11.18%

benchmark                                     old MB/s     new MB/s     speedup
BenchmarkEncode/Digits/Huffman/1e4-32         874.80       1249.91      1.43x
BenchmarkEncode/Digits/Huffman/1e5-32         811.86       1337.25      1.65x
BenchmarkEncode/Digits/Huffman/1e6-32         793.40       1333.29      1.68x
BenchmarkEncode/Digits/Speed/1e4-32           284.90       420.91       1.48x
BenchmarkEncode/Digits/Speed/1e5-32           148.07       259.10       1.75x
BenchmarkEncode/Digits/Speed/1e6-32           145.38       205.18       1.41x
BenchmarkEncode/Digits/Default/1e4-32         157.70       244.06       1.55x
BenchmarkEncode/Digits/Default/1e5-32         55.07        124.76       2.27x
BenchmarkEncode/Digits/Default/1e6-32         52.98        123.43       2.33x
BenchmarkEncode/Digits/Compression/1e4-32     156.59       117.27       0.75x
BenchmarkEncode/Digits/Compression/1e5-32     55.44        36.33        0.66x
BenchmarkEncode/Digits/Compression/1e6-32     52.82        32.54        0.62x
BenchmarkEncode/Newton/Huffman/1e4-32         634.13       900.25       1.42x
BenchmarkEncode/Newton/Huffman/1e5-32         743.12       1175.04      1.58x
BenchmarkEncode/Newton/Huffman/1e6-32         601.00       970.70       1.62x
BenchmarkEncode/Newton/Speed/1e4-32           305.35       436.03       1.43x
BenchmarkEncode/Newton/Speed/1e5-32           176.80       296.96       1.68x
BenchmarkEncode/Newton/Speed/1e6-32           166.78       262.09       1.57x
BenchmarkEncode/Newton/Default/1e4-32         141.83       292.84       2.06x
BenchmarkEncode/Newton/Default/1e5-32         42.12        175.23       4.16x
BenchmarkEncode/Newton/Default/1e6-32         40.71        167.34       4.11x
BenchmarkEncode/Newton/Compression/1e4-32     139.85       128.75       0.92x
BenchmarkEncode/Newton/Compression/1e5-32     29.89        26.80        0.90x
BenchmarkEncode/Newton/Compression/1e6-32     27.96        25.15        0.90x

Static Memory Usage:

Before:
Level -2: Memory Used: 704KB, 8 allocs
Level -1: Memory Used: 776KB, 7 allocs
Level 0: Memory Used: 704KB, 7 allocs
Level 1: Memory Used: 1160KB, 13 allocs
Level 2: Memory Used: 776KB, 8 allocs
Level 3: Memory Used: 776KB, 8 allocs
Level 4: Memory Used: 776KB, 8 allocs
Level 5: Memory Used: 776KB, 8 allocs
Level 6: Memory Used: 776KB, 8 allocs
Level 7: Memory Used: 776KB, 8 allocs
Level 8: Memory Used: 776KB, 9 allocs
Level 9: Memory Used: 776KB, 8 allocs

After:
Level -2: Memory Used: 272KB, 12 allocs
Level -1: Memory Used: 1016KB, 7 allocs
Level 0: Memory Used: 304KB, 6 allocs
Level 1: Memory Used: 760KB, 13 allocs
Level 2: Memory Used: 1144KB, 8 allocs
Level 3: Memory Used: 1144KB, 8 allocs
Level 4: Memory Used: 888KB, 14 allocs
Level 5: Memory Used: 1016KB, 8 allocs
Level 6: Memory Used: 1016KB, 8 allocs
Level 7: Memory Used: 952KB, 7 allocs
Level 8: Memory Used: 952KB, 7 allocs
Level 9: Memory Used: 1080KB, 9 allocs

This package has been fuzz tested for about 24 hours.
Currently, there is about 1h between new "interesting" finds.

Change-Id: Icb4c9839dc8f1bb96fd6d548038679a7554a559b
@austinderek austinderek force-pushed the master branch 3 times, most recently from af6999e to 37c78b5 Compare September 27, 2025 14:02
Change-Id: I0ac5571da9585daba9491b360c9a6b4e0cecbcee
…table bytes.

Change-Id: Ia141c7ec888bf51ceb6351d2a1c3f1501c2c4e12
Change-Id: I1cef87da8cf7a2f2b330115f8eeecb7bf825af76
@austinderek austinderek force-pushed the master branch 23 times, most recently from 37c78b5 to 34e6762 Compare September 28, 2025 02:29
@austinderek austinderek force-pushed the master branch 30 times, most recently from fe3ba74 to d17a28a Compare September 29, 2025 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants