Add tests for `std::[unordered_][multi]set` #39

enirolf · 2025-02-28T14:21:53Z

Closes #14.

hahnjo

Thanks for the PR! I left some comments, mostly we should start by figuring out what we want to test for the binary format in particular. Originally I considered being lazy for other containers and put all std::[unordered_][multi]set in a single test, without the full combination of index column types. But I think the multi containers make this awkward because there we actually want to test that the duplicates are preserved, and also we may want to be 100% certain that we get the index columns right. For the non-multi containers though, duplicates are essentially handled on the C++ side, before it comes to RNTuple I think, so not sure if we need those entries here...

hahnjo · 2025-03-04T08:07:47Z

types/set/fundamental/write.C

+  // Fourth entry: duplicate elements in the set
+  *Index32 = {1, 1};
+  *Index64 = {2, 2};
+  *SplitIndex32 = {3, 3};
+  *SplitIndex64 = {4, 4};
+  writer->Fill();


Hm, this entry is the same from the RNTuple point of view because the deduplication will happen on the C++ side, before we enter Fill(). Does it provide extra coverage?

hahnjo · 2025-03-04T08:08:25Z

types/set/fundamental/write.C

+  *Index32 = {2, 1};
+  *Index64 = {4, 3};
+  *SplitIndex32 = {6, 5};
+  *SplitIndex64 = {8, 7};
+  writer->Fill();


This also seems like it's mostly about the C++ semantics of std::set, not really the binary format...

types/set/nested/LinkDef.h

types/set/nested/write.C

types/README.md

types/multiset/fundamental/write.C

hahnjo · 2025-03-04T08:17:46Z

types/unordered_set/fundamental/write.C

+  *Index32 = {2, 1};
+  *Index64 = {4, 3};
+  *SplitIndex32 = {6, 5};
+  *SplitIndex64 = {8, 7};
+  writer->Fill();


Do we make guarantees about the element order on disk? If not, I'm not sure we need to test C++ semantics in the validation suite...

types/set/nested/README.md

hahnjo · 2025-03-04T08:28:43Z

types/set/fundamental/read.C

+  Set &value = *entry.GetPtr<Set>(name);
+  os << "    \"" << name << "\": [";
+  bool first = true;
+  for (auto element : value) {


Actually, is the iteration order defined here? Do we want an explicit construction of a std::vector (?) and sort it? If not, the output file might change from execution to execution...

The iteration order is defined by the (optional) comparison predicate in the set template (std::less by default), so unless this is different from the one used when writing, the order will be the same (see also https://stackoverflow.com/a/8834041)

Ah, I commented on the wrong test: the question is mostly relevant for std::unordered_[multi]set, but for symmetry we may want to do it for all tests.

types/set/nested/write.C

enirolf · 2025-03-04T11:25:59Z

Thanks for the PR! I left some comments, mostly we should start by figuring out what we want to test for the binary format in particular. Originally I considered being lazy for other containers and put all std::[unordered_][multi]set in a single test, without the full combination of index column types. But I think the multi containers make this awkward because there we actually want to test that the duplicates are preserved, and also we may want to be 100% certain that we get the index columns right. For the non-multi containers though, duplicates are essentially handled on the C++ side, before it comes to RNTuple I think, so not sure if we need those entries here...

Yeah that's fair enough! I suppose we could squash [unordered_]set and [unordered_]multiset. One thing to consider is that it's not a given that every writer/reader will be written in C++ (we already know this isn't the case), but the spec explicitly refers to C++ types. But you're right that still the ordering and duplicate handling is not something that the format itself is responsible for.

Taking it even further, the specification explicitly states that the on-disk representation is identical to std::vector, so if we really want to be strict about the scope we could even argue that these tests are somewhat redundant (except for maybe type handling?)..

hahnjo · 2025-03-04T14:57:15Z

I suppose we could squash [unordered_]set and [unordered_]multiset.

That might be a good compromise because we probably want the same input entries / entry classes for multiset and unordered_multiset...

Taking it even further, the specification explicitly states that the on-disk representation is identical to std::vector, so if we really want to be strict about the scope we could even argue that these tests are somewhat redundant (except for maybe type handling?)..

Yes, in my opinion we want each supported C++ type to appear at least once in the validation suite. But indeed, the question is how much different is it from a binary format perspective. There's two axes to that: index column encoding and nesting (e.g. std::set<std::set<std::int32_t>>).

pcanal · 2025-03-04T17:43:15Z

But indeed, the question is how much different is it from a binary format perspective.

As a side note, in roottest we do have a test of reading each STL collections in file format into all other STL collections of the same content (for a large subset of cases)

hahnjo · 2025-03-05T07:54:44Z

After thinking this over, here are some more considerations:

I think we want the nested tests of containers, both that values can be non-fundamental types (non-simple fields in ROOT) and that they work when below another container. It's not really much different from a binary format point of view, but I think the suite should also (try to) validate the write and read implementation and there I see plenty of ways how the (de)serialization can get things wrong (for example, just taking the entry number to get the size of a collection). If we can do the two things with a single test (e.g. std::set<std::set<std::int32_t>>) the better.
We may want at least kIndex64 and kSplitIndex64 because those are the two default encodings (depending on compression on/off), so they are what will be used most often. kIndex32 and kSplitIndex32 are kind of niche. Again not really an argument of the binary format but more of the implementation.

The decision whether to merge [unordered_]set and [unordered_]multiset I leave to you. In the end, given that we already have the tests written out, we may as well just keep them and make it easier to find the tests for a particular container.

...for the tests that need them (currently, all nested `std::set` and friends).

enirolf · 2025-03-17T10:30:52Z

Based on this statement:

Yes, in my opinion we want each supported C++ type to appear at least once in the validation suite.

and because I right now don't have a clear idea how to nicely merge the ordered and unordered variants (two field types instead of one, i.e., [Unordered][Split]Index{32,64}? that would mix the field type with the column representation in a way though, which I don't like), for now I would opt to keep the four variants separate. I can remove the "redundant" test cases (e.g., the duplications in std::set) but I also don't think a bit of redundancy necessarily hurts here. What do you think?

hahnjo · 2025-03-20T16:31:07Z

Thanks for the work of also updating the infrastructure and the CI, seems to pass 😃

for now I would opt to keep the four variants separate

Fine with me.

I can remove the "redundant" test cases (e.g., the duplications in std::set) but I also don't think a bit of redundancy necessarily hurts here. What do you think?

I would tend to remove the "duplicate" and "reverse" tests because at least I personally find it confusing that the output doesn't match the input, and it's actually not RNTuple that is responsible for that. Let's hear some opinions from others maybe?

enirolf added the types label Feb 28, 2025

enirolf requested review from hahnjo, pcanal and silverweed February 28, 2025 14:21

enirolf self-assigned this Feb 28, 2025

enirolf mentioned this pull request Feb 28, 2025

Fix Experimental-related deprecation warnings #40

Closed

enirolf force-pushed the types-sets branch from 570132f to 8ebb4a1 Compare February 28, 2025 16:14

hahnjo requested changes Mar 4, 2025

View reviewed changes

hahnjo reviewed Mar 4, 2025

View reviewed changes

types/set/nested/write.C Outdated Show resolved Hide resolved

enirolf added 5 commits March 17, 2025 10:17

Add test for std::set

35fa098

Add test for std::unordered_set

c875826

Add test for std::multiset

c9f0a5b

Add test for std::unordered_multiset

7c5cb3e

Automatically create and load dictionaries

ae167fa

...for the tests that need them (currently, all nested `std::set` and friends).

enirolf force-pushed the types-sets branch from 8ebb4a1 to ae167fa Compare March 17, 2025 10:25

enirolf requested a review from hahnjo March 17, 2025 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tests for `std::[unordered_][multi]set` #39

Add tests for `std::[unordered_][multi]set` #39

enirolf commented Feb 28, 2025

hahnjo left a comment

hahnjo Mar 4, 2025

hahnjo Mar 4, 2025

hahnjo Mar 4, 2025

hahnjo Mar 4, 2025

enirolf Mar 5, 2025

hahnjo Mar 20, 2025

enirolf commented Mar 4, 2025 •

edited

Loading

hahnjo commented Mar 4, 2025

pcanal commented Mar 4, 2025

hahnjo commented Mar 5, 2025

enirolf commented Mar 17, 2025

hahnjo commented Mar 20, 2025

Add tests for std::[unordered_][multi]set #39

Are you sure you want to change the base?

Add tests for std::[unordered_][multi]set #39

Conversation

enirolf commented Feb 28, 2025

hahnjo left a comment

Choose a reason for hiding this comment

hahnjo Mar 4, 2025

Choose a reason for hiding this comment

hahnjo Mar 4, 2025

Choose a reason for hiding this comment

hahnjo Mar 4, 2025

Choose a reason for hiding this comment

hahnjo Mar 4, 2025

Choose a reason for hiding this comment

enirolf Mar 5, 2025

Choose a reason for hiding this comment

hahnjo Mar 20, 2025

Choose a reason for hiding this comment

enirolf commented Mar 4, 2025 • edited Loading

hahnjo commented Mar 4, 2025

pcanal commented Mar 4, 2025

hahnjo commented Mar 5, 2025

enirolf commented Mar 17, 2025

hahnjo commented Mar 20, 2025

Add tests for `std::[unordered_][multi]set` #39

Add tests for `std::[unordered_][multi]set` #39

enirolf commented Mar 4, 2025 •

edited

Loading