-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve spill performance: Disable re-validation of spilled files #15454
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is great.
read_spill/StreamReader/read_100/with_validation/
time: [3.7400 ms 3.7508 ms 3.7639 ms]
Found 9 outliers among 100 measurements (9.00%)
4 (4.00%) low mild
3 (3.00%) high mild
2 (2.00%) high severe
read_spill/StreamReader/read_100/skip_validation/
time: [2.4500 ms 2.4623 ms 2.4785 ms]
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
However, I'm not familiar with the arrow upgrade process, here are some very basic questions:
Should we upgrade all arrow sub-crates in lockstep (all arrow-*
to 54.3.0) to avoid conflict? @alamb
This StreamReader
update should be inside arrow-ipc
crate, why only updating arrow
dependency in cargo.toml
make this update available?
criterion_group!(benches, bench_spill_read); | ||
criterion_main!(benches); | ||
|
||
fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest writing the benchmark by timing SpillManager
's IO speed for test batches, instead of several thin wrappers on arrow functions, in the future it might get hard to figure out what this micro-bench is for. This way we can also track the IO performance over time, and it's okay not able to compare the validation difference in this PR, we can apply some manual changes to verify the speed-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the review, I'll follow your suggestion in the pr :)
// SAFETY: DataFusion's spill writer strictly follows Arrow IPC specifications | ||
// with validated schemas and buffers. Skip redundant validation during read | ||
// to speedup read operation. This is a deliberate safety-performance tradeoff. | ||
let reader = unsafe { StreamReader::try_new(file, None)?.with_skip_validation(true) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
@@ -87,7 +87,7 @@ ahash = { version = "0.8", default-features = false, features = [ | |||
"runtime-rng", | |||
] } | |||
apache-avro = { version = "0.17", default-features = false } | |||
arrow = { version = "54.2.1", features = [ | |||
arrow = { version = "54.3.0", features = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the other arrow dependencies need to be upgraded as well (this is automatically done by cargo already, but makes sense to update it in this file as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree -- we need to specify 54.3.0 as we are now using a feature from 54.3.0
and I think we should do all the other crates too as @Dandandan suggests
Thank you @zebsme |
Thank you! I think it's almost ready. The only remaining task is to upgrade the other Arrow dependencies to 54.3.0 similar to #14153 did. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zebsme and @2010YOUY01 and @Dandandan
I'll wait to merge this until @2010YOUY01 has had a chance to review
|
||
// BENCHMARK: REVALIDATION OVERHEAD COMPARISON | ||
// --------------------------------------------------------- | ||
// To compare performance with/without Arrow IPC validation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏼
Co-authored-by: Daniël Heres <[email protected]>
Merged up to get latest changes and rerun CI |
Merged up from main to try and get a clean CI run |
Which issue does this PR close?
mmap
the spill files #15321.Rationale for this change
Speed up spill read by enabling skip_validation
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No