-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add heap_size to statistics #19599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add heap_size to statistics #19599
Conversation
| pub fn heap_size(&self) -> usize { | ||
| // column_statistics + num_rows + total_byte_size | ||
| self.column_statistics.capacity() * size_of::<ColumnStatistics>() | ||
| + size_of::<Precision<usize>>() * 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here Precision<usize> is an enum and does not have a heap allocated fields, so it is allocated in the stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these things are usually Arc'ed - so everything should be moved to the heap, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So size_of::<Precision<usize>>() * 2 should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, if we want to follow the trait in arrow, which I think according to #19599 (comment) was the conclusion of the next step? Do you plan on push a commit to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes i am on it. pr coming up soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to get the heap size of arrays to implement it for Statistics? What's the chain of fields that takes us there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/apache/datafusion/blob/main/datafusion/common/src/scalar/mod.rs#L376 as part of ScalarValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And these are part of ColumnStatistics https://github.com/apache/datafusion/blob/main/datafusion/common/src/scalar/mod.rs#L376
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bummer. Isn't there ways to get the size of an array in memory? E.g. Array::get_array_memory_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may actually work. Thanks, I will try that.
| /// Returns the memory size in bytes. | ||
| pub fn heap_size(&self) -> usize { | ||
| // column_statistics + num_rows + total_byte_size | ||
| self.column_statistics.capacity() * size_of::<ColumnStatistics>() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO you need to iterate over the column_statistics and add their heap allocations
https://github.com/mkleen/datafusion/blob/41b875d19d9c3671e141cec11814afd91a06a0f1/datafusion/common/src/stats.rs#L732-L736 - there are Precision<ScalarValue> fields and the ScalarValue enum has variants which use String, Vec and Box
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Should we add heap_size() to ScalarValue as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. So would it make sense to add heap_size() to column_statistics ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we want to accurately calculate the sizes we need to propagate that method all the way down. I'm not sure at what point we should just be using https://crates.io/crates/deepsize or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly none looks all that popular (going off of stars) or battle tested. Hard choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepsize2 is a fork of deepsize + applying all opened PRs and updating the (optional) dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense then as the choice if we’re going to use an external crate. It seems like the most ergonomic. This does seems like something that should be more of a community wide decision than isolated in this PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the next step then? File a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #19615. I think for this PR I would recommend adopting arrow-rs's HeapSize with manual implementations. If we add a macro there later we can update to use that.
This adds a heap_size method returning the amount of memory a statistics struct allocates on the heap.
Which issue does this PR close?
Relates to #19052 (comment)
Rationale for this change
This adds a
heap_sizemethod returning the amount of memory a statistics struct allocates on the heap.What changes are included in this PR?
NA.
Are these changes tested?
Yes
Are there any user-facing changes?
One new method on
Statistics.