Skip to content

[Python][Parquet] Support LZ4_RAW for parquet writing #41863

@douglas-raillard-arm

Description

@douglas-raillard-arm

Describe the enhancement requested

pyarrow.dataset.write_dataset(compression='lz4_raw') currently fails with:

Traceback (most recent call last):
  File "/work/projects/lisa/testpyarrow.py", line 3, in <module>
    _reencode_parquet('sched_switch.lz4.parquet', 'updated.parquet', compression='lz4_raw')#, row_group_size=128*1024*1024, compression='LZ4')
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "x.py", line 1, in my_write_parquet
    options = pyarrow.dataset.ParquetFileFormat().make_write_options(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset_parquet.pyx", line 206, in pyarrow._dataset_parquet.ParquetFileFormat.make_write_options
  File "pyarrow/_dataset_parquet.pyx", line 594, in pyarrow._dataset_parquet.ParquetFileWriteOptions.update
  File "pyarrow/_dataset_parquet.pyx", line 599, in pyarrow._dataset_parquet.ParquetFileWriteOptions._set_properties
  File "pyarrow/_parquet.pyx", line 1855, in pyarrow._parquet._create_writer_properties
  File "pyarrow/_parquet.pyx", line 1369, in pyarrow._parquet.check_compression_name
pyarrow.lib.ArrowException: Unsupported compression: lz4_raw

And indeed, no mention of lz4_raw is to be found in python/pyarrow/_parquet.pyx.

Would it be possible to add support for LZ4_RAW codec when writing parquet files, particularly using the dataset API ?

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions