Skip to content

Add tests for std::[unordered_][multi]set #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Feb 28, 2025

Closes #14.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left some comments, mostly we should start by figuring out what we want to test for the binary format in particular. Originally I considered being lazy for other containers and put all std::[unordered_][multi]set in a single test, without the full combination of index column types. But I think the multi containers make this awkward because there we actually want to test that the duplicates are preserved, and also we may want to be 100% certain that we get the index columns right. For the non-multi containers though, duplicates are essentially handled on the C++ side, before it comes to RNTuple I think, so not sure if we need those entries here...

Comment on lines +68 to +73
// Fourth entry: duplicate elements in the set
*Index32 = {1, 1};
*Index64 = {2, 2};
*SplitIndex32 = {3, 3};
*SplitIndex64 = {4, 4};
writer->Fill();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this entry is the same from the RNTuple point of view because the deduplication will happen on the C++ side, before we enter Fill(). Does it provide extra coverage?

Comment on lines +76 to +80
*Index32 = {2, 1};
*Index64 = {4, 3};
*SplitIndex32 = {6, 5};
*SplitIndex64 = {8, 7};
writer->Fill();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems like it's mostly about the C++ semantics of std::set, not really the binary format...

Comment on lines +78 to +82
*Index32 = {2, 1};
*Index64 = {4, 3};
*SplitIndex32 = {6, 5};
*SplitIndex64 = {8, 7};
writer->Fill();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we make guarantees about the element order on disk? If not, I'm not sure we need to test C++ semantics in the validation suite...

Set &value = *entry.GetPtr<Set>(name);
os << " \"" << name << "\": [";
bool first = true;
for (auto element : value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, is the iteration order defined here? Do we want an explicit construction of a std::vector (?) and sort it? If not, the output file might change from execution to execution...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iteration order is defined by the (optional) comparison predicate in the set template (std::less by default), so unless this is different from the one used when writing, the order will be the same (see also https://stackoverflow.com/a/8834041)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I commented on the wrong test: the question is mostly relevant for std::unordered_[multi]set, but for symmetry we may want to do it for all tests.

@enirolf
Copy link
Contributor Author

enirolf commented Mar 4, 2025

Thanks for the PR! I left some comments, mostly we should start by figuring out what we want to test for the binary format in particular. Originally I considered being lazy for other containers and put all std::[unordered_][multi]set in a single test, without the full combination of index column types. But I think the multi containers make this awkward because there we actually want to test that the duplicates are preserved, and also we may want to be 100% certain that we get the index columns right. For the non-multi containers though, duplicates are essentially handled on the C++ side, before it comes to RNTuple I think, so not sure if we need those entries here...

Yeah that's fair enough! I suppose we could squash [unordered_]set and [unordered_]multiset. One thing to consider is that it's not a given that every writer/reader will be written in C++ (we already know this isn't the case), but the spec explicitly refers to C++ types. But you're right that still the ordering and duplicate handling is not something that the format itself is responsible for.

Taking it even further, the specification explicitly states that the on-disk representation is identical to std::vector, so if we really want to be strict about the scope we could even argue that these tests are somewhat redundant (except for maybe type handling?)..

@hahnjo
Copy link
Member

hahnjo commented Mar 4, 2025

I suppose we could squash [unordered_]set and [unordered_]multiset.

That might be a good compromise because we probably want the same input entries / entry classes for multiset and unordered_multiset...

Taking it even further, the specification explicitly states that the on-disk representation is identical to std::vector, so if we really want to be strict about the scope we could even argue that these tests are somewhat redundant (except for maybe type handling?)..

Yes, in my opinion we want each supported C++ type to appear at least once in the validation suite. But indeed, the question is how much different is it from a binary format perspective. There's two axes to that: index column encoding and nesting (e.g. std::set<std::set<std::int32_t>>).

@pcanal
Copy link
Member

pcanal commented Mar 4, 2025

But indeed, the question is how much different is it from a binary format perspective.

As a side note, in roottest we do have a test of reading each STL collections in file format into all other STL collections of the same content (for a large subset of cases)

@hahnjo
Copy link
Member

hahnjo commented Mar 5, 2025

After thinking this over, here are some more considerations:

  1. I think we want the nested tests of containers, both that values can be non-fundamental types (non-simple fields in ROOT) and that they work when below another container. It's not really much different from a binary format point of view, but I think the suite should also (try to) validate the write and read implementation and there I see plenty of ways how the (de)serialization can get things wrong (for example, just taking the entry number to get the size of a collection). If we can do the two things with a single test (e.g. std::set<std::set<std::int32_t>>) the better.
  2. We may want at least kIndex64 and kSplitIndex64 because those are the two default encodings (depending on compression on/off), so they are what will be used most often. kIndex32 and kSplitIndex32 are kind of niche. Again not really an argument of the binary format but more of the implementation.

The decision whether to merge [unordered_]set and [unordered_]multiset I leave to you. In the end, given that we already have the tests written out, we may as well just keep them and make it easier to find the tests for a particular container.

@enirolf
Copy link
Contributor Author

enirolf commented Mar 17, 2025

Based on this statement:

Yes, in my opinion we want each supported C++ type to appear at least once in the validation suite.

and because I right now don't have a clear idea how to nicely merge the ordered and unordered variants (two field types instead of one, i.e., [Unordered][Split]Index{32,64}? that would mix the field type with the column representation in a way though, which I don't like), for now I would opt to keep the four variants separate. I can remove the "redundant" test cases (e.g., the duplications in std::set) but I also don't think a bit of redundancy necessarily hurts here. What do you think?

@enirolf enirolf requested a review from hahnjo March 17, 2025 10:35
@hahnjo
Copy link
Member

hahnjo commented Mar 20, 2025

Thanks for the work of also updating the infrastructure and the CI, seems to pass 😃

for now I would opt to keep the four variants separate

Fine with me.

I can remove the "redundant" test cases (e.g., the duplications in std::set) but I also don't think a bit of redundancy necessarily hurts here. What do you think?

I would tend to remove the "duplicate" and "reverse" tests because at least I personally find it confusing that the output doesn't match the input, and it's actually not RNTuple that is responsible for that. Let's hear some opinions from others maybe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test(s) for std::[unordered_][multi]set
3 participants