-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Right now data is transferred from Python to JS fully uncompressed:
Line 68 in 6a64c6f
| feather.write_feather(table, bio, compression="uncompressed") |
Uncompressed data is fine for local kernels, where Python and the browser are on the same machine, but not ideal for remote kernels, like JupyterHub or Colab, where Python is on a remote server and data has to be downloaded before it can be rendered on a map.
Data Compression options
There are a few options for data compression:
- Uncompressed
- Apply a simple compression like gzip to the entire table buffer. This is simple to implement on both the Python and JS sides, but is quite slow
- Apply compression in the Arrow IPC format. This file format supports only "light compression" (LZ4 or ZSTD) and doesn't do any other encoding like delta encoding for smaller file size. The downside is that reading compressed IPC files is not currently supported by Arrow JS.
- Use Parquet. This has the most efficient compression, but it has the downsides of requiring a WebAssembly-based parser on the JS side. Adding the Wasm could make the build setup more difficult.
Different settings for local/remote?
Another question is whether it's possible to have different compression defaults based on whether the Python session is local or remote. Ideally a local Python kernel could use no compression while a remote Python kernel could use the most efficient compression.
The problem is that because Python-Jupyter follows a server-client model, I don't know of a good way to know from Python whether the attached client is running locally or remotely. There could be some heuristics like checking if google.colab in sys.modules but that's only valid in the colab case.
So it seems like the best default would be fast, moderate-size compression, and then have a parameter to let the user choose either no compression or slow, small-file-size compression.
Unscientific benchmarks
Unscientific benchmarks using the utah dataset of 1 million buildings (7M coords):
| Compression Type | File size | Write time |
|---|---|---|
| Feather (uncompressed) | 144 MB | 17 ms |
| gzip full-buffer compression | 64 MB | 13 s |
| Feather (ZSTD) | 80 MB | 200 ms |
| Feather (LZ4) | 97 MB | 147 ms |
| Parquet (Snappy) | 82 MB | 444 ms |
| Parquet (gzip) | 60 MB | 4.5 s |
| Parquet (brotli) | 45 MB | 3.7 s |
| Parquet (ZSTD) | 74 MB | 466 ms |
| Parquet (ZSTD level 22) | 41.6 MB | 11 s |
| Parquet (ZSTD level 18) | 41.6 MB | 9.8 s |
| Parquet (ZSTD level 16) | 48.3 MB | 5.7 s |
| Parquet (ZSTD level 14) | 49.8 MB | 2.7 s |
| Parquet (ZSTD level 12) | 49.8 MB | 1.9 s |
| Parquet (ZSTD level 10) | 49.8 MB | 1.7 s |
| Parquet (ZSTD level 8) | 50.3 MB | 1.4 s |
| Parquet (ZSTD level 7) | 50.3 MB | 1.25 s |
| Parquet (ZSTD level 6) | 51.4 MB | 1.2 s |
| Parquet (ZSTD level 4) | 57.8 MB | 800 ms |
| Parquet (ZSTD level 2) | 69.1 MB | 560 ms |
Given this, ZSTD around level ~7 seems to have a very good combination of write speed and file size, and likely makes sense as a default.