[POC] zstd decompression #94

rkistner · 2025-06-26T10:27:55Z

This investigates the possibility of decompressing zstd data in the core extension, which could allow us using zstd data in the protocol. This POC only looks at the zstd decompression itself and tests its performance, and does not actively use it anywhere.

Usage:

cargo build -p powersync_loadable --release
sqlite3 test.db # db with compressed data pre-loaded
.load ./target/release/libpowersync
with dictionary as materialized (select readfile('dictionary') as dict)
select sum(length(zstd_decompress_text(data, dict))) from compressed_data, dictionary;

On my machine, this takes around 500ms to decompress 80MB of data over 100k rows. This will likely be more efficient if we parse the dictionary up-front.

This increases linux release build size from around 400kb -> 487kb. Adding compression support would add another 260kb or so, and I don't think we have a good use case for that on the client.

To actually use this with compressed data, we'd need to additionally:

Use streaming decompression instead of pre-allocating buffers (I think this works by decompressing a block at a time).
Manage dictionaries - we'd need to persist them somewhere, then load, parse and cache them in memory when used.
Implement changes in the protocol and service to send compressed data and dictionaries to the client.

simolus3 · 2025-06-26T11:14:56Z

This will likely be more efficient if we parse the dictionary up-front.

Would we re-use the dictionary across sync lines? Is the dictionary dynamic and sent by the service?

rkistner · 2025-06-26T11:40:43Z

Would we re-use the dictionary across sync lines? Is the dictionary dynamic and sent by the service?

In short, yes. My initial idea is that compression would be used for the data, rather than the protocol. This allows keeping the data compressed as is from bucket storage -> sync stream (websocket) -> ps_oplog, potentially even in the data tables (data tables would be opt-in, since this would slow down queries).

However, if we do compression on the data of individual rows, the compression ratio for small rows isn't great (around 0.6 of the original size in my tests). However, if we pre-train a dictionary on the same data, the compression ratio can be as good as 0.2 or 0.1, a 5-10x reduction in size.

This does have some implications:

We'll need to train a separate compression dictionary (or multiple) for each bucket, to avoid leaking data between buckets/users.
We'll need a way to indicate which dictionary is used for which bucket or operations.
We'll need a way to get those dictionaries to the client.
We'll need to manage those dictionaries on the client.

So overall the project becomes quite complex, but I don't think compression is worth it without using dictionaries.

This PR is just one small piece of evaluating the feasibility of the project: Can we efficiently do decompression on the client?

rkistner added 2 commits June 26, 2025 12:17

zstd decompression POC.

5d557d8

Disable default-features for zstd-safe.

c1f4c82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC] zstd decompression #94

[POC] zstd decompression #94

Uh oh!

rkistner commented Jun 26, 2025 •

edited

Loading

Uh oh!

simolus3 commented Jun 26, 2025 •

edited

Loading

Uh oh!

rkistner commented Jun 26, 2025

Uh oh!

Uh oh!

[POC] zstd decompression #94

Are you sure you want to change the base?

[POC] zstd decompression #94

Uh oh!

Conversation

rkistner commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simolus3 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkistner commented Jun 26, 2025

Uh oh!

Uh oh!

rkistner commented Jun 26, 2025 •

edited

Loading

simolus3 commented Jun 26, 2025 •

edited

Loading