-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
I haven't looked at the code yet, but I am glad that this demonstration is here!
I have a few things to say up front:
- the comparison of on-disk size should be made versus highly Zstd-compressed JSON, since the parquet data is compressed internally, and Zstd adds little overhead when reading. Also, make sure you have ujson installed for fair comparison, as it's much faster than standard json, and referenceFS will use it when it can.
- loading time and file size can probably both be improved by setting offset/size to 0 rather than NULL where there ie embedded data; this may also be true if loading the string columns as string[pyarrow] (NB: you don't specify the parquet engine to use). That column type is not yet supported in fastparquet, but would be far more efficient in terms of memory usage.
- you mention the future with lazy loading and partitioning, and I think these will be the killer features. Remember, we could structure the parquet data files to contain only specific subsets of key prefixes, or even use the path names of the data files to encode part of the keys. This is necessary as reference sets will easily become too big to simply fit into memory; but of course then we need to have an extra indirection and cache parts of the dataframe.
- do we have a benchmark of df.loc versus dict[item]? What about if the df is sorted on the index? Is pandas actually the most efficient data structure for this?
How much of an ask is it to upstream this into referenceFS? Should it be a different implementation when it shares everything except the reference lookups?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels