Skip to content

[POC] zstd decompression #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

[POC] zstd decompression #94

wants to merge 2 commits into from

Conversation

rkistner
Copy link
Contributor

@rkistner rkistner commented Jun 26, 2025

This investigates the possibility of decompressing zstd data in the core extension, which could allow us using zstd data in the protocol. This POC only looks at the zstd decompression itself and tests its performance, and does not actively use it anywhere.

Usage:

cargo build -p powersync_loadable --release
sqlite3 test.db # db with compressed data pre-loaded
.load ./target/release/libpowersync
with dictionary as materialized (select readfile('dictionary') as dict)
select sum(length(zstd_decompress_text(data, dict))) from compressed_data, dictionary;

On my machine, this takes around 500ms to decompress 80MB of data over 100k rows. This will likely be more efficient if we parse the dictionary up-front.

This increases linux release build size from around 400kb -> 487kb. Adding compression support would add another 260kb or so, and I don't think we have a good use case for that on the client.

To actually use this with compressed data, we'd need to additionally:

  1. Use streaming decompression instead of pre-allocating buffers (I think this works by decompressing a block at a time).
  2. Manage dictionaries - we'd need to persist them somewhere, then load, parse and cache them in memory when used.
  3. Implement changes in the protocol and service to send compressed data and dictionaries to the client.

@simolus3
Copy link
Contributor

simolus3 commented Jun 26, 2025

This will likely be more efficient if we parse the dictionary up-front.

Would we re-use the dictionary across sync lines? Is the dictionary dynamic and sent by the service?

@rkistner
Copy link
Contributor Author

Would we re-use the dictionary across sync lines? Is the dictionary dynamic and sent by the service?

In short, yes. My initial idea is that compression would be used for the data, rather than the protocol. This allows keeping the data compressed as is from bucket storage -> sync stream (websocket) -> ps_oplog, potentially even in the data tables (data tables would be opt-in, since this would slow down queries).

However, if we do compression on the data of individual rows, the compression ratio for small rows isn't great (around 0.6 of the original size in my tests). However, if we pre-train a dictionary on the same data, the compression ratio can be as good as 0.2 or 0.1, a 5-10x reduction in size.

This does have some implications:

  1. We'll need to train a separate compression dictionary (or multiple) for each bucket, to avoid leaking data between buckets/users.
  2. We'll need a way to indicate which dictionary is used for which bucket or operations.
  3. We'll need a way to get those dictionaries to the client.
  4. We'll need to manage those dictionaries on the client.

So overall the project becomes quite complex, but I don't think compression is worth it without using dictionaries.

This PR is just one small piece of evaluating the feasibility of the project: Can we efficiently do decompression on the client?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants