[Data] Add Polars batch format support to map_batches #58896

soffer-anyscale · 2025-11-21T19:42:32Z

Description

Adds support for batch_format="polars" to ray.data.map_batches(), allowing users to work with Polars DataFrames as batch format alongside existing numpy, pandas, and pyarrow formats.

Polars is a fast DataFrame library written in Rust with a Python API that provides significant performance improvements over pandas for many operations. As Polars adoption grows in the open-source data processing ecosystem, users increasingly want to use it in their Ray Data pipelines. This change enables seamless integration of Polars into Ray Data workflows, allowing users to leverage Polars' optimized query engine and expressive API when processing batches.

Related issues

N/A

Additional information

Why Polars Support?

Performance: Polars provides 10-100x speedups over pandas for many operations due to its Rust-based implementation and query optimization
Growing adoption: Polars has seen rapid adoption in the Python data science community as a modern alternative to pandas
Better ergonomics: Polars' lazy evaluation and expression API provide a more intuitive interface for complex data transformations
Memory efficiency: Polars' columnar architecture and zero-copy reads can reduce memory usage compared to pandas

Changes

Add 'polars' to VALID_BATCH_FORMATS
Implement to_polars() methods in ArrowBlockAccessor and PandasBlockAccessor
Add batch_to_block_from_polars() for converting Polars DataFrames to blocks
Update _validate_batch_output() to accept Polars DataFrames
Add comprehensive validation for LazyFrame, Series, and invalid types
Update documentation with Polars examples and performance notes
Add tests for Polars format in map_batches, iter_batches, and take_batch
Include URLs to Polars documentation in docstrings
Document that Polars format always creates copies (no zero-copy support)

Usage Example

import ray
import polars as pl

def process_batch(batch: pl.DataFrame) -> pl.DataFrame:
    return batch.with_columns([
        (pl.col("value") * 2).alias("doubled")
    ])

ds = ray.data.range(100)
result = ds.map_batches(process_batch, batch_format="polars")

Performance Notes

Polars format conversions always create copies of data (zero-copy is not possible). This may result in 2-3x memory usage compared to Arrow format due to input and output conversions. For large datasets, consider using batch_format="pyarrow" for better memory efficiency.

- Add 'polars' to VALID_BATCH_FORMATS - Implement to_polars() methods in ArrowBlockAccessor and PandasBlockAccessor - Add batch_to_block_from_polars() for converting Polars DataFrames to blocks - Update _validate_batch_output() to accept Polars DataFrames - Add comprehensive validation for LazyFrame, Series, and invalid types - Update documentation with Polars examples and performance notes - Add tests for Polars format in map_batches, iter_batches, and take_batch - Include URLs to Polars documentation in docstrings - Document that Polars format always creates copies (no zero-copy support) Signed-off-by: soffer-anyscale <[email protected]>

Copilot

Pull request overview

This PR adds support for Polars DataFrames as a batch format in Ray Data's map_batches() API, enabling users to leverage Polars' performance-optimized DataFrame library alongside existing formats (numpy, pandas, pyarrow).

Key Changes:

Added "polars" as a valid batch format option in Ray Data APIs
Implemented conversion methods (to_polars()) in block accessors to transform blocks to Polars DataFrames
Added validation logic for Polars-specific types (LazyFrame, Series) with helpful error messages

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`python/ray/data/tests/test_map_batches.py`	Added tests for Polars batch format in `map_batches()` including timestamp handling
`python/ray/data/tests/test_consumption.py`	Added assertions to verify Polars DataFrame type in batch format tests
`python/ray/data/tests/block_batching/test_util.py`	Extended batch format test parameterization to include Polars
`python/ray/data/dataset.py`	Updated docstrings to document Polars format option and memory behavior
`python/ray/data/block.py`	Added Polars support to batch format constants, types, and conversion logic
`python/ray/data/_internal/planner/plan_udf_map_op.py`	Added validation for Polars LazyFrame and DataFrame in UDF outputs
`python/ray/data/_internal/pandas_block.py`	Implemented `to_polars()` method for converting Pandas blocks to Polars DataFrames
`python/ray/data/_internal/arrow_block.py`	Implemented `to_polars()` method for converting Arrow blocks to Polars DataFrames
`doc/source/data/transforming-data.rst`	Added Polars examples and updated documentation on batch formats
`doc/source/data/iterating-over-data.rst`	Added Polars example for iterating over batches
`doc/source/data/inspecting-data.rst`	Added Polars example for `take_batch()`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-21T19:44:05Z

python/ray/data/_internal/arrow_block.py

+        # Polars works better with contiguous arrays
+        from ray.data._internal.arrow_ops import transform_pyarrow
+
+        combined_table = transform_pyarrow.combine_chunks(self._table, copy=False)


The comment states 'This conversion creates a copy of the data', but the code uses copy=False in combine_chunks(). This is inconsistent and may confuse developers. Either update the comment to clarify that combine_chunks doesn't copy but the subsequent Polars conversion does, or explain the complete copy behavior more accurately.

Copilot · 2025-11-21T19:44:05Z

python/ray/data/block.py

+        """Create a block from a Polars DataFrame.
+
+        Converts a Polars DataFrame to an Arrow Table or Pandas DataFrame block.
+        See https://docs.pola.rs/ for Polars documentation.


Corrected URL typo from 'pola.rs' to 'pola.rs' - should be 'docs.pola.rs' or 'pola-rs.github.io/polars'.

Copilot · 2025-11-21T19:44:06Z

python/ray/data/_internal/pandas_block.py

+
+        # Convert to Polars DataFrame using from_pandas()
+        # See https://docs.pola.rs/api/dataframe/#polars.DataFrame.from_pandas
+        return pl.from_pandas(self._table)


[nitpick] The docstring states 'This conversion creates a copy of the data', but doesn't explain why zero-copy is impossible. Consider adding a brief explanation (e.g., 'due to differences in memory layout between Pandas and Polars') to help developers understand the limitation.

gemini-code-assist

Code Review

This pull request adds support for batch_format="polars" to ray.data.map_batches(), which is a great enhancement for users of the Polars library. The changes are comprehensive, covering documentation, new conversion methods, validation logic, and tests. My review identified a few areas for improvement, including some inefficient or redundant code, and a bug in a deprecated function that was updated. Overall, this is a solid contribution that will improve the Ray Data ecosystem.

gemini-code-assist · 2025-11-21T19:45:00Z

python/ray/data/dataset.py

        """
        # Check that batch_format
-        accepted_batch_formats = ["pandas", "pyarrow", "numpy"]
+        accepted_batch_formats = ["pandas", "pyarrow", "numpy", "polars"]


Adding "polars" to accepted_batch_formats for the deprecated add_column method will cause a runtime error. The add_column implementation does not handle the "polars" batch format and will fall through to the else block, which assumes a NumPy batch format, leading to an AssertionError.

Since add_column is deprecated, it's probably best to not add Polars support to it. Please remove "polars" from accepted_batch_formats to avoid this bug.

Suggested change

accepted_batch_formats = ["pandas", "pyarrow", "numpy", "polars"]

accepted_batch_formats = ["pandas", "pyarrow", "numpy"]

gemini-code-assist · 2025-11-21T19:45:00Z

python/ray/data/_internal/pandas_block.py

+        if len(self._table.columns) != len(set(self._table.columns)):
+            duplicates = [
+                col for col in self._table.columns
+                if list(self._table.columns).count(col) > 1
+            ]
+            raise ValueError(
+                f"Pandas DataFrame has duplicate column names: {duplicates}. "
+                "Rename duplicate columns before converting to Polars."
+            )


This check for duplicate columns is a bit inefficient. A more idiomatic and performant way to check for and find duplicates in a pandas Index is to use self._table.columns.is_unique and self._table.columns.duplicated().

if not self._table.columns.is_unique: duplicates = self._table.columns[self._table.columns.duplicated()].unique().tolist() raise ValueError( f"Pandas DataFrame has duplicate column names: {duplicates}. " "Rename duplicate columns before converting to Polars." )

gemini-code-assist · 2025-11-21T19:45:00Z

python/ray/data/_internal/planner/plan_udf_map_op.py

+                            if isinstance(out_batch, pl.LazyFrame):
+                                raise ValueError(
+                                    "Generator yielded a Polars LazyFrame. "
+                                    "All yielded frames must be materialized. "
+                                    "Call .collect() on LazyFrames before yielding. "
+                                    "See https://docs.pola.rs/api/lazyframe/#collect for details."
+                                )
+                            elif isinstance(out_batch, pl.DataFrame):
+                                # DataFrame is always eager, but verify it's valid
+                                try:
+                                    # Access schema to ensure DataFrame is valid
+                                    _ = out_batch.schema
+                                except Exception as e:
+                                    raise ValueError(
+                                        f"Polars DataFrame is in invalid state: {e}. "
+                                        "Ensure the DataFrame is properly constructed."
+                                    ) from e


The check for pl.LazyFrame here is redundant. The _validate_batch_output function, called on line 538, already checks if the output is a pl.LazyFrame and raises an appropriate ValueError. You can remove the isinstance(out_batch, pl.LazyFrame) block and convert the elif to an if to avoid duplicated logic.

Suggested change

if isinstance(out_batch, pl.LazyFrame):

raise ValueError(

"Generator yielded a Polars LazyFrame. "

"All yielded frames must be materialized. "

"Call .collect() on LazyFrames before yielding. "

"See https://docs.pola.rs/api/lazyframe/#collect for details."

)

elif isinstance(out_batch, pl.DataFrame):

# DataFrame is always eager, but verify it's valid

try:

# Access schema to ensure DataFrame is valid

_ = out_batch.schema

except Exception as e:

raise ValueError(

f"Polars DataFrame is in invalid state: {e}. "

"Ensure the DataFrame is properly constructed."

) from e

if isinstance(out_batch, pl.DataFrame):

# DataFrame is always eager, but verify it's valid

try:

# Access schema to ensure DataFrame is valid

_ = out_batch.schema

except Exception as e:

raise ValueError(

f"Polars DataFrame is in invalid state: {e}. "

"Ensure the DataFrame is properly constructed."

) from e

gemini-code-assist · 2025-11-21T19:45:00Z

python/ray/data/block.py

+        if len(batch.columns) != len(set(batch.columns)):
+            duplicates = [
+                col for col in batch.columns if batch.columns.count(col) > 1
+            ]
+            raise ValueError(
+                f"Polars DataFrame has duplicate column names: {duplicates}. "
+                "Rename duplicate columns before converting."
+            )


This check for duplicate columns appears to be redundant. Polars DataFrames do not allow duplicate column names, and attempting to create one will raise a DuplicateError. This validation can be removed.

soffer-anyscale requested a review from a team as a code owner November 21, 2025 19:42

soffer-anyscale requested a review from Copilot November 21, 2025 19:43

Copilot AI reviewed Nov 21, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 21, 2025

View reviewed changes

ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Add Polars batch format support to map_batches #58896

[Data] Add Polars batch format support to map_batches #58896

Uh oh!

soffer-anyscale commented Nov 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

Copilot AI Nov 21, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 21, 2025

Uh oh!

gemini-code-assist bot Nov 21, 2025

Uh oh!

gemini-code-assist bot Nov 21, 2025

Uh oh!

gemini-code-assist bot Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	accepted_batch_formats = ["pandas", "pyarrow", "numpy", "polars"]
	accepted_batch_formats = ["pandas", "pyarrow", "numpy"]

[Data] Add Polars batch format support to map_batches #58896

Are you sure you want to change the base?

[Data] Add Polars batch format support to map_batches #58896

Uh oh!

Conversation

soffer-anyscale commented Nov 21, 2025

Description

Related issues

Additional information

Why Polars Support?

Changes

Usage Example

Performance Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants