Skip to content

Data compression over the wire #37

@kylebarron

Description

@kylebarron

Right now data is transferred from Python to JS fully uncompressed:

feather.write_feather(table, bio, compression="uncompressed")

Uncompressed data is fine for local kernels, where Python and the browser are on the same machine, but not ideal for remote kernels, like JupyterHub or Colab, where Python is on a remote server and data has to be downloaded before it can be rendered on a map.

Data Compression options

There are a few options for data compression:

  • Uncompressed
  • Apply a simple compression like gzip to the entire table buffer. This is simple to implement on both the Python and JS sides, but is quite slow
  • Apply compression in the Arrow IPC format. This file format supports only "light compression" (LZ4 or ZSTD) and doesn't do any other encoding like delta encoding for smaller file size. The downside is that reading compressed IPC files is not currently supported by Arrow JS.
  • Use Parquet. This has the most efficient compression, but it has the downsides of requiring a WebAssembly-based parser on the JS side. Adding the Wasm could make the build setup more difficult.

Different settings for local/remote?

Another question is whether it's possible to have different compression defaults based on whether the Python session is local or remote. Ideally a local Python kernel could use no compression while a remote Python kernel could use the most efficient compression.

The problem is that because Python-Jupyter follows a server-client model, I don't know of a good way to know from Python whether the attached client is running locally or remotely. There could be some heuristics like checking if google.colab in sys.modules but that's only valid in the colab case.

So it seems like the best default would be fast, moderate-size compression, and then have a parameter to let the user choose either no compression or slow, small-file-size compression.

Unscientific benchmarks

Unscientific benchmarks using the utah dataset of 1 million buildings (7M coords):

Compression Type File size Write time
Feather (uncompressed) 144 MB 17 ms
gzip full-buffer compression 64 MB 13 s
Feather (ZSTD) 80 MB 200 ms
Feather (LZ4) 97 MB 147 ms
Parquet (Snappy) 82 MB 444 ms
Parquet (gzip) 60 MB 4.5 s
Parquet (brotli) 45 MB 3.7 s
Parquet (ZSTD) 74 MB 466 ms
Parquet (ZSTD level 22) 41.6 MB 11 s
Parquet (ZSTD level 18) 41.6 MB 9.8 s
Parquet (ZSTD level 16) 48.3 MB 5.7 s
Parquet (ZSTD level 14) 49.8 MB 2.7 s
Parquet (ZSTD level 12) 49.8 MB 1.9 s
Parquet (ZSTD level 10) 49.8 MB 1.7 s
Parquet (ZSTD level 8) 50.3 MB 1.4 s
Parquet (ZSTD level 7) 50.3 MB 1.25 s
Parquet (ZSTD level 6) 51.4 MB 1.2 s
Parquet (ZSTD level 4) 57.8 MB 800 ms
Parquet (ZSTD level 2) 69.1 MB 560 ms

Given this, ZSTD around level ~7 seems to have a very good combination of write speed and file size, and likely makes sense as a default.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions