Skip to content

[arrow-avro] Add Explicit Projection API to ReaderBuilder #8923

@jecsand838

Description

@jecsand838

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As work has progressed integrating arrow-avro into Apache DataFusion’s Avro datasource, we've come across complications and potential blockers related to projection pushdown.

Currently the arrow-avro ReaderBuilder handles projection via the schema resolution process, however this results in callers needing to implement complex and redundant logic for what should be seamless and simple. This projection logic involves deriving projection reader schemas from writer schemas/SchemaRef and aligning schema metadata. Furthermore, yet additional complexity is placed on the caller in the event they need to do a one-off projection against a pre-defined reader schema stored externally for other schema resolution needs. In that scenario they'd need to pre-process this reader schema to drop fields while preserving the other intended schema resolution paths (promotion, renaming/reordering, etc.).

While the current arrow-avro implementation technically supports projection indirectly and the DataFusion Avro Datasource work shouldn't be blocked, all potential solutions are unnecessarily complex for the caller and not ideal.

Describe the solution you'd like

I’d like arrow-avro’s ReaderBuilder to gain an explicit projection API that:

  1. Adds a projection method on ReaderBuilder
    Something along the lines of:

    impl ReaderBuilder {
        /// Set the column projection for reading.
        ///
        /// `projection` is a list of zero-based top-level field indices
        /// (matching the Arrow/Avro field order).
        pub fn with_projection(self, projection: Vec<usize>) -> Self { ... }
    }

    The semantics would mirror the CSV ReaderBuilder::with_projection API (which already uses a Vec<usize> of column indices).

  2. If no reader schema is provided, derive a reader schema from the writer schema using the projection.
    When reader_schema is None and projection is Some(indices), ReaderBuilder::build / ReaderBuilder::build_decoder would:

    • (ReaderBuilder::build): Read the OCF header and obtain the writer schema (as it already does).
    • (ReaderBuilder::build_decoder): Obtain the writer schemas from the provided writer schema registry.
    • Construct an Avro reader schema that:
      • (OCF) Uses the same root record name, namespace, and other Avro‑specific metadata from the writer schema via the existing AVRO_NAME_METADATA_KEY, AVRO_NAMESPACE_METADATA_KEY, etc.
      • (SOE) similar to OCF with the added caveat of being compatible with each writer schema.
      • Keeps only the top-level fields indicated by the projection indices, in the projected order.
    • Use that derived Avro schema internally as reader_schema when constructing the RecordDecoder / Reader.

    This keeps all Avro-specific details anchored to the real writer schema from the OCF header or writer schemas from the writer schema registry while still allowing efficient columnar projection for Avro encoded data.

  3. If a reader schema is provided, prune it to match the projection.
    When reader_schema is Some(schema) and projection is Some(indices), ReaderBuilder::build / ReaderBuilder::build_decoder would:

    • Parse the provided reader schema into the internal Avro Schema representation.
    • Prune its top-level record fields so that only those referenced by the projection remain (preserving existing Avro metadata for those fields).
    • Use the pruned schema for decoding.

    This lets callers continue to specify a reader schema for evolution / type overrides, while still applying an additional column projection in a single place. The projection is always interpreted relative to the Arrow/Avro top-level field ordering (consistent with other Arrow readers). the OCF and SOE paths would essentially be the same.

  4. Scope and compatibility

    • If projection is None, behavior remains exactly as today, preserving backward compatibility.
    • The goal is to centralize Avro‑aware projection logic in arrow-avro, so downstream projects (like DataFusion) don’t need to reimplement Avro schema editing or risk diverging from arrow-avro’s resolution rules.

Describe alternatives you've considered

  1. Constructing a projected Avro reader schema in DataFusion itself.

    DataFusion could try to:

    • Persist the original Avro schema JSON alongside the Arrow schema,
    • Apply projection to that JSON in its own code (i.e., pruning fields), and
    • Pass the result to ReaderBuilder::with_reader_schema.

    However, this duplicates logic that already conceptually belongs to arrow-avro (schema resolution + projection). It would be easy for a downstream implementation to diverge subtly from arrow-avro’s behavior, especially around name resolution and metadata like avro.name and avro.namespace.

  2. Reapplying Avro metadata (record name/namespace) after using AvroSchema::try_from(ArrowSchema).

    Another option is to let DataFusion continue converting its projected Arrow schema back into an AvroSchema, then manually patch in the correct root record name and namespace via metadata constants like AVRO_NAME_METADATA_KEY and AVRO_NAMESPACE_METADATA_KEY.

    This is fragile for a couple of reasons:

    • It requires DataFusion to either track and re-inject the original writer schema’s naming information or reconstruct it heuristically.
    • It still doesn’t leverage any internal arrow-avro machinery for resolving writer vs reader schemas when projection is involved.
  3. Not using projection at the Avro level at all.

    DataFusion could simply read all fields from OCF into Arrow and then project at the Arrow layer. While simple, this is undesirable for:

    • Wide OCF schemas where only a small subset of columns is needed.
    • Workloads where I/O and decode costs are a bottleneck; projection pushdown is a key optimization that other Arrow readers (i.e. CSV, Parquet) already expose via ReaderBuilder.

Overall, these alternatives either duplicate arrow-avro logic downstream or fail to provide the performance characteristics that DataFusion and similar engines need. While one of these solutions maybe acceptable for the short term (until next arrow-rs major release), we want to ensure arrow-avro provides a long term and optimized path forward.

Additional context

  • This feature is directly motivated by the ongoing effort to switch DataFusion’s Avro datasource over to the upstream arrow-avro reader. That PR currently needs to juggle between preserving the original Avro root record name and applying projection, and the discussion there explicitly calls out the need for a ReaderBuilder::with_projection style API on arrow-avro.
  • The broader effort to adopt arrow-avro in DataFusion is tracked under "Use arrow-avro for performance and improved type support" (apache/datafusion#14097).
  • Other Arrow Rust readers such as arrow_csv::reader::ReaderBuilder already expose a with_projection(Vec<usize>) method, so adding an analogous method to arrow_avro::reader::ReaderBuilder would make the API more consistent across formats and easier to adopt in query engines that already rely on projection pushdown for CSV and Parquet.

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions