-
Notifications
You must be signed in to change notification settings - Fork 34
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Feature Category
Performance Optimization
Problem / Use Case
Currently, Bolt using PrestoSerde to serialize RowVector when spill, but it seems PrestoSerde has performace issue. For example, in one workload, Row based spills more data (407G vs. 301G), but uses much less time for serialization (6.28h vs. 11.30h) and compression + flush (4.19h vs. 5.6h). It seems for spilling, the bottleneck is not on I/O (25min vs. 8min) but the serialization.
- Basic structure for serializer: append * n -> flush
- Reason why Presto serializer is slow:
- Copy once to StreamArena when append with flatten cost (if need)
- Another copy to Outputstream when flush happened.
- Current Bolt (SEP.25.2025) has 3 serializers (in bolt/serializers)
- Row based: *RowSerializer
- Column based: Only PrestoSerializer
- Newly added: ArrowSerializer
Proposed Solution
- Using Arrow to make serializer faster
- Bolt RowVector is Apache Arrow compatible.
- Basic Idea: Bolt RowVector <-> Arrow <-> ArrowIPC
References / Prior Art
No response
Importance
Medium (Nice to have)
Willingness to Contribute
Yes, I can submit a PR
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request