Use arrow pool to fix memory over accounting in aggregations #19501

LiaCastaneda · 2025-12-26T16:53:53Z

Which issue does this PR close?

Closes #
Related to the https://github.com/apache/datafusion/issues/16841#issuecomment-3563643947

Rationale for this change

Aggregation accumulators that store Arrow Arrays can have memory over accounting when array buffers are shared between multiple ScalarValues or directly Arrow Arrays. This occurs because doing ScalarValue::try_from_array() create slices that reference the same underlying buffers and each ScalarValue reports the full buffer size when calculating memory usage and the same physical buffer gets counted multiple times, leading to over accounting (this comment explains very well why we are seeing this)

There have been several attempts to fix this before which included compacting data to not keep the whole array alive, or using get_slice_memory_size instead of get_array_memory_size. However, we observed that both had downsides:

compact() was CPU inefficient (copies data which can be expensive)
get_slice_memory_size() only accounts for logical memory, but it is not the actual physical buffer capacity, therefore the amount returned is not accurate.

What changes are included in this PR?

This approach avoids double-counting memory by using Arrow's TrackingMemoryPool, which automatically deduplicates shared buffers when accounting them. This means we don't need to compact() or call get_slice_memory_size() just to solve the accounting problem. Note that compact() might still be useful when we want to release memory pressure.

Updated Accumulator::size() and GroupsAccumulator::size() signatures to accept Option<&dyn MemoryPool>:
- When pool is None Returns total memory size including Arrow buffers using either get_slice_memory_size or just ScalarValue-> size () (same as before, so its backward compatible)
- When pool is Some iteturns structural size only and claims buffers with the pool for deduplication tracking. Callers using the pool must add pool.used() to get total memory.

Updated accumulators that use the pool parameter:

DistinctCountAccumulator
ArrayAggAccumulator
For OrderSensitiveArrayAggAccumulator and DistinctArrayAggAccumulator I removed the compacting since it was introduced specifically to solve the over accounting, and its not needed anymore.
FirstValueAccumulator / LastValueAccumulator
All other accumulator implementations had to be updated to match new signature

Are these changes tested?

Added distinct_count_does_not_over_account_memory() test to test memory pool deduplication for COUNT(DISTINCT) with array types. Also updated the existing accumulator tests to use memory pool, it verifies the accounted memory is still less than when not using the memory pool (in some cases even less than when we compacted).

Are there any user-facing changes?

yes, the API size for Accumulators and GroupAccumulators changed from fn size(&self) -> usize; to fn size(&self, pool: Option<&dyn MemoryPool>) -> usize;

Not sure if this is the best API design... I'm open to suggestions. In any case, if None is passed the behavior will remain the same as before. Also IIUC this function is mainly used to keep the DF memory pool within its bounds during aggregations.

LiaCastaneda · 2025-12-30T10:15:19Z

This is probably not the perfect solution -- but maybe a starting point? We've seen over accounting problems in topk as well, so maybe the pool could also be integrated there? I'm open to suggestions :)

gabotechs · 2025-12-30T10:31:19Z

Nice! I'll review this one soon

LiaCastaneda · 2025-12-30T12:03:15Z

datafusion/functions-aggregate/src/min_max.rs

+    fn size(&self, _pool: Option<&dyn MemoryPool>) -> usize {
        size_of_val(self) - size_of_val(&self.max) + self.max.size()
    }
 }


MinAccumulator and MaxAccumulator could also use the pool since they hold a ScalarValue, which could hold an Arrow Array -> shared buffers. I didn’t do it to avoid adding more changes in this PR.

…l-to-fix-memory-overaccounting-aggregations

gabotechs

Glad to see progress on the memory accounting issues! this definitely improves the situation.

Before moving forward with it, I think it might be worth clarifying the long term direction of memory tracking in arrow-rs/datafusion. Here are a couple of thoughts about the current state of things:

There are two different MemoryPool traits with overlapping intentions: the one from DataFusion and the one from arrow-rs. My impression is that this is an undesirable state, and we might want to consolidate into one.
The current memory reservation mechanism is passive, meaning that it's up to the developers to manually register/deregister whatever memory was used in the appropriate MemoryPool, rather than happening automatically during the actual allocation, which can open the door for mistakes that lead to under/over accounting
The overall assumption that aggregation accumulators, execution plans, etc... have a "size" might be flawed. Ultimately, what occupies space in memory is the shared buffers in the underlaying data, and not the accumulators or the execution plans. In arrow-rs all buffers are shared, and therefore, no single struct can claim its ownership, or that it's part of its "size".

In the C++ Arrow MemoryPool implementation, it's the MemoryPool itself who performs the allocation, and therefore, having a MemoryPool there is not optional. Having a MemoryPool be the one responsible for allocations, and for failing if a new allocation would overflow the max capacity, looks in my opinion a superior approach that leaves very little room for errors.

After reviewing the C++ Arrow's memory management model, I get the impression that is the most advanced and mature one, and none of the implementations in arrow-rs or DataFusion seem to mirror it, so I wonder if rather than building on top of the current pillars, we should instead be exploring ways of getting those pillars very right from the beginning.

gabotechs · 2025-12-31T08:56:59Z

datafusion/functions-aggregate/src/array_agg.rs

+
+        if let Some(pool) = pool {
+            for arr in &self.values {
+                claim_buffers_recursive(&arr.to_data(), pool);


I imagine that .size() here can be called an arbitrary amount of times. What would happen with claim_buffers_recursive if this is called a lot of times?

it will call claim() multiple times on the same Buffer/Bytes, but is safe because each call replaces the previous reservation (and does not add to it), so the pool will only track each buffer once regardless of how many times size() is called

gabotechs · 2025-12-31T09:11:19Z

datafusion/functions-aggregate/src/array_agg.rs

+            if let Some(array) = scalar.get_array_ref() {
+                total += size_of::<Arc<dyn Array>>();
+                if let Some(pool) = pool {
+                    claim_buffers_recursive(&array.to_data(), pool);


I see that if a &dyn MemoryPool is passed, then the array size does not compute towards the total size, and it's instead claimed in the &dyn MemoryPool instead.

Imagine this scenario:

The underlying array is huge

A dyn MemoryPool is passed, so the array size does not compute towards the total_size, it's just claimed in the Arrow Buffer memory pool

In GroupedHashAggregateStream::update_memory_reservation, the total_size is very small, as the array size did not compute towards it.

When calling reservation.try_resize() with the small total_size, the reservation succeeds

Isn't this a problematic scenario?

Yes, that's why after calling size() we add arrow_pool.used() to account for the buffer memory (I forgot to add it). I also considered calling arrow_pool.used() directly inside each accumulator's size() function so the caller doesn't have to remember to do it. However, that would still cause over-accounting in scenarios like update_memory_reservation() where we sum size() across multiple accumulators since they can share buffers as well:

let total_size = self.group_values.size() + self.group_ordering.size() + self.current_group_indices.capacity() * size_of::<usize>() + self .accumulators .iter() .map(|x| x.size(Some(&self.arrow_pool))) .sum::<usize>() <--- If each size() returned arrow_pool.used() we would still be over counting the pool

So this is how we would use the pool properly (link) to calculate the total size without counting shared buffers multiple times

🤔 Ok, and that works because we have 1 TrackingArrowPool per GroupedHashAggregateStream 👍

gabotechs · 2025-12-31T09:13:26Z

datafusion/functions-aggregate/src/first_last.rs

+            if let Some(array) = scalar.get_array_ref() {
+                total += size_of::<Arc<dyn Array>>();
+                if let Some(pool) = pool {
+                    claim_buffers_recursive(&array.to_data(), pool);
+                } else {
+                    total += scalar.size() - size_of_val(scalar);
+                }
+            } else {
+                total += scalar.size() - size_of_val(scalar);
+            }


This pattern seems to be repeated several times across the project. Maybe a helper could be useful?

yep, good idea

LiaCastaneda · 2025-12-31T12:42:02Z

Yes, I completely agree with you. I created apache/arrow-rs#8938 to raise this issue on the Arrow side. In most other Arrow implementations, they carry around a context object that contains the memory pool, so newly created arrays are immediately accounted for in the pool. In that case, DataFusion wouldn't have to do anything other than use the Arrow memory pool instead of the DataFusion pool. However, I'm aware this would require a considerable amount of effort on the Arrow side, and I'm not sure what the community there thinks of this idea.

gabotechs · 2026-01-05T10:00:24Z

Even if this PR looks great, I'd love to explore what would it take to get the memory tracking fundamentals very right from the beginning, and see if there's an opportunity to be in a situation where memory tracking is more automatic, and we do not need to deal with two different MemoryPool traits with overlapping intentions.

Wrote a small comment about that here apache/arrow-rs#8938 (comment). What do you think?

LiaCastaneda force-pushed the lia/use-arrow-pool-to-fix-memory-overaccounting-aggregations branch from 72d5f92 to 2378e6e Compare December 26, 2025 17:01

Fix over accounting in array accumulators using Arrow Memory Pool

7d158c9

LiaCastaneda force-pushed the lia/use-arrow-pool-to-fix-memory-overaccounting-aggregations branch from 2378e6e to 7d158c9 Compare December 26, 2025 17:07

LiaCastaneda mentioned this pull request Dec 29, 2025

Force compact in topK when we hit memory limit #19386

Open

LiaCastaneda marked this pull request as ready for review December 30, 2025 10:01

LiaCastaneda changed the title ~~Use arrow pool to fix memory over accounting aggregations~~ Use arrow pool to fix memory over accounting in aggregations Dec 30, 2025

LiaCastaneda commented Dec 30, 2025

View reviewed changes

github-actions bot added the documentation Improvements or additions to documentation label Dec 30, 2025

Fix cargo doc test

cb09c94

LiaCastaneda force-pushed the lia/use-arrow-pool-to-fix-memory-overaccounting-aggregations branch from 1b5bcc0 to cb09c94 Compare December 30, 2025 13:28

LiaCastaneda added 2 commits December 30, 2025 14:36

Merge remote-tracking branch 'upstreamDF/main' into lia/use-arrow-poo…

1ac8ba9

…l-to-fix-memory-overaccounting-aggregations

Fix Cargo doc test again

a5b960c

LiaCastaneda mentioned this pull request Dec 31, 2025

Compute Dynamic Filters only when a consumer supports them #18938

Closed

gabotechs reviewed Dec 31, 2025

View reviewed changes

LiaCastaneda added 2 commits December 31, 2025 14:26

add used()

5a0eecb

simplify code

a4e9b94

Use arrow pool to fix memory over accounting in aggregations #19501

Are you sure you want to change the base?

Use arrow pool to fix memory over accounting in aggregations #19501

Conversation

LiaCastaneda commented Dec 26, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

LiaCastaneda commented Dec 30, 2025

Uh oh!

gabotechs commented Dec 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda commented Dec 31, 2025

Uh oh!

gabotechs commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gabotechs left a comment •

edited

Loading

LiaCastaneda Dec 31, 2025 •

edited

Loading