Skip to content

[Feature] Adopt arrow serializer in VectorSerde #73

@zhangxffff

Description

@zhangxffff

Feature Category

Performance Optimization

Problem / Use Case

Currently, Bolt using PrestoSerde to serialize RowVector when spill, but it seems PrestoSerde has performace issue. For example, in one workload, Row based spills more data (407G vs. 301G), but uses much less time for serialization (6.28h vs. 11.30h) and compression + flush (4.19h vs. 5.6h). It seems for spilling, the bottleneck is not on I/O (25min vs. 8min) but the serialization.

  • Basic structure for serializer: append * n -> flush
  • Reason why Presto serializer is slow:
    • Copy once to StreamArena when append with flatten cost (if need)
    • Another copy to Outputstream when flush happened.
  • Current Bolt (SEP.25.2025) has 3 serializers (in bolt/serializers)
    • Row based: *RowSerializer
    • Column based: Only PrestoSerializer
    • Newly added: ArrowSerializer

Proposed Solution

  • Using Arrow to make serializer faster
    • Bolt RowVector is Apache Arrow compatible.
    • Basic Idea: Bolt RowVector <-> Arrow <-> ArrowIPC

References / Prior Art

No response

Importance

Medium (Nice to have)

Willingness to Contribute

Yes, I can submit a PR

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions