Skip to content

Conversation

@greedAuguria
Copy link

Which issue does this PR close?

Rationale for this change

Currently, when DataFusion parses Hive-style partitioned paths (e.g., s3://bucket/table/city=San%20Francisco/), it extracts the partition value literally as San%20Francisco.

Standard practice in tools like Apache Spark and Apache Hive is to URL-encode special characters in partition values when writing to object stores. This PR ensures DataFusion correctly decodes these values (to San Francisco) during the listing process, preventing data mismatches and ensuring consistent behavior with other engines.

What changes are included in this PR?

  1. Logic Update: Updated parse_partitions_for_path in datafusion/catalog-listing/src/helpers.rs to percent-decode partition values.
  2. Signature Change: Changed the return type of parse_partitions_for_path from Option<Vec<&str>> to Option<Vec<String>> because decoded strings require new allocations.
  3. Call Site Updates: Updated internal callers in datafusion-catalog-listing to accommodate the owned String return type.
  4. Dependencies: Added percent-encoding crate to datafusion-catalog-listing.

Are these changes tested?

Yes, new unit tests were added to datafusion/catalog-listing/src/helpers.rs covering:

  • Standard URL-encoded characters (e.g., %2F for /).
  • Spaces encoded as %20.
  • Multi-byte UTF-8 characters.
  • Forgiving behavior for malformed encoding (matching percent-encoding crate defaults).

Are there any user-facing changes?

Yes:

  • Data Behavior: Partition values extracted from file paths will now be correctly decoded instead of remaining URL-encoded.
  • API Change: The public helper function parse_partitions_for_path now returns Vec<String> instead of Vec<&str>.

@github-actions github-actions bot added the catalog Related to the catalog crate label Jan 5, 2026
file_path: &'a Path,
table_partition_cols: I,
) -> Option<Vec<&'a str>>
) -> Option<Vec<String>>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure whether it is worth it to return a Vec<Cow<'a, str>> here ?! E.g. if the partition contains % then decode it and return Cow::Owned(decoded.into_owned()), otherwise `Cow::Borrowed(val).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

match part.split_once('=') {
Some((name, val)) if name == expected_partition => part_values.push(val),
Some((name, val)) if name == expected_partition => {
let decoded = percent_decode_str(val).decode_utf8().ok()?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should invalid values (like %FF in one of the tests below) return None or val ?!
I think it should behave like %XX - i.e. return val

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

catalog Related to the catalog crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Partition values are not URL-decoded when extracted from Hive-style paths

2 participants