Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNtuple __len__ should return the number of keys, and have a num_entries property for the number of entries #1220

Open
jpivarski opened this issue May 16, 2024 · 0 comments · May be fixed by #1250
Assignees
Labels
feature New feature or request

Comments

@jpivarski
Copy link
Member

Although it may be counterintuitive, uproot.TTree and uproot.RNTuple should return the number of branches/top-level columns from __len__, rather than the number of entries. This is because they satisfy the Mapping protocol: __len__ is the length of the output of keys, values, and items.

For the number of entries, uproot.RNTuple should follow uproot.TTree in having a num_entries property. Users will need this information to know how to set entry_start and entry_stop.


Maybe uproot.RNTuple needs an iterate method as well. For TTree, this is functionally equivalent to

def iterate(self, *args, step_size=<default>, **kwargs):
    step_size = regularize_step_size(step_size)
    for start in range(0, self.num_entries, step_size):
        yield self.arrays(*args, entry_start=start, entry_stop=start + step_size, **kwargs)

but TTree does a little more and the partially read TBaskets that were trimmed to yield one array are reused if they're part of the next array (taking advantage of the fact that iterate is sequential: we know what's coming next). Here's where that happens:

previous_baskets = {}
for sub_entry_start in range(entry_start, entry_stop, entry_step):
sub_entry_stop = min(sub_entry_start + entry_step, entry_stop)
if sub_entry_stop - sub_entry_start == 0:
continue
ranges_or_baskets = []
checked = set()
for _, context in expression_context:
for branch in context["branches"]:
if branch.cache_key not in checked:
checked.add(branch.cache_key)
for (
basket_num,
range_or_basket,
) in branch.entries_to_ranges_or_baskets(
sub_entry_start, sub_entry_stop
):
previous_basket = previous_baskets.get(
(branch.cache_key, basket_num)
)
if previous_basket is None:
ranges_or_baskets.append(
(branch, basket_num, range_or_basket)
)
else:
ranges_or_baskets.append(
(branch, basket_num, previous_basket)
)
arrays = {}
interp_options = {"ak_add_doc": ak_add_doc}
_ranges_or_baskets_to_arrays(
self,
ranges_or_baskets,
branchid_interpretation,
sub_entry_start,
sub_entry_stop,
decompression_executor,
interpretation_executor,
library,
arrays,
True,
interp_options,
)
_fix_asgrouped(
arrays,
expression_context,
branchid_interpretation,
library,
how,
ak_add_doc,
)
output = language.compute_expressions(
self,
arrays,
expression_context,
keys,
aliases,
self.file.file_path,
self.object_path,
)
# no longer needed; save memory
del arrays
minimized_expression_context = [
(e, c)
for e, c in expression_context
if c["is_primary"] and not c["is_cut"]
]
out = _ak_add_doc(
library.group(output, minimized_expression_context, how),
self,
ak_add_doc,
)
next_baskets = {}
for branch, basket_num, basket in ranges_or_baskets:
basket_entry_start, basket_entry_stop = basket.entry_start_stop
if basket_entry_stop > sub_entry_stop:
next_baskets[branch.cache_key, basket_num] = basket
previous_baskets = next_baskets

I don't know if the RNTuple code is organized in such a way that it would be easy to do this. (Not that I would call the TTree implementation "easy": this feature is obfuscating the code quite a bit.) I'm sympathetic to the argument that this isn't worth doing, since parallel chunked-access methods are more important than sequential ones.

@jpivarski jpivarski added the feature New feature or request label May 16, 2024
@ariostas ariostas self-assigned this Jun 20, 2024
@ariostas ariostas linked a pull request Jul 15, 2024 that will close this issue
@ariostas ariostas linked a pull request Jul 15, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants