added use of base, when configured, in dcgm sampling. #1623

baallan · 2025-03-03T21:40:54Z

before jobid data didn't get set if use_base was set.

tom95858 · 2025-03-05T02:51:57Z

ldms/scripts/examples/dcgm1.1

@@ -1,3 +1,3 @@
 load name=dcgm_sampler
-config name=dcgm_sampler producer=localhost${i} schema=${testname} instance=localhost${i}/${testname} component_id=${i} interval=1000000 perm=757 uid=3556 gid=3556 job_set=instance=localhost${i}/job_info use_base=1
+config name=dcgm_sampler producer=localhost${i} schema=${testname} instance=localhost${i}/${testname} component_id=${i} interval=1000000 perm=757 uid=3556 gid=3556 job_set=localhost${i}/jobid use_base=1


Sorry, but I am missing why the use of the base class is optional?

it was optional because @morrone coded it that way. In some previous PR, I added the use_base option (the implementation of which was evidently incomplete, since this patch fixes a bug). The original sampler_base implementation doesn't contemplate convenient multiple set semantics, but exposes the base struct so all good things are obtainable.

Yeah, it was the first thing I ever wrote for LDMS, back before I'd learned about all the unwritten rules for samplers. :)

Engaging in layering violations (fiddling with sampler_base's internal struct) is a pretty big kludge, but is fine for now. Ultimately, we really need to either give sampler_base support for multiple metric sets per schema, or we overhaul dcgm_sampler to use a single metric set with an array for the multiple devices.

It seems like one-metric-set-per-schema is more-or-less the new "LDMS Way", now that we have arrays and decomposition.

@baallan, @morrone, it is probably time to create a new "base class" with support for multiple sets configurable tenant data etc... @narategithub is working on that right now and I believe he has a WIP pull request where it can be reviewed.

wrt dcgm, ib switch, and other devices that are potentially considerably slower to query, or even timeout-y, I would prefer we keep the set-per-device design so that the duration time (the window of set inconsistency) is per-device instead of the larger time of all device collections summed. This helps distinguish problematic devices in ib, and possibly in gpus. If a device in a clump gets slow to answer, we may never get a set consistency window even though other devices are fine.

I agree there may be devices which make sense as record arrays because the data query looks at currently cached kernel data instead of poking hardware (e.g. I don't personally know if local disk arrays fall in this category or not).

BTW, this is the reason why the sampler threading was refactored for 4.5.1 to add the ability to assign these types of samplers to their own thread so that they don't impact the sampling latency of the other samplers.

@tom95858 I don't think that is what @baallan was talking about. The sampler dedicated-thread code, to the best of my understanding, addresses interval jitter (what I assume you are referring to by "latency"). It allows sample() for a given plugin instance/configuration to be called at the scheduled time a little more accurately. This is because it has its own thread, instead of sample() being called sequentially for multiple samplers that have the same scheduled sample() time. correct?

That wouldn't have any change on the time spent inside sample(), which is what Ben was talking about. Even more specifically he is concerned about a subset of the sample() time: the time spent modifying a particular metric set (the time it is marked "inconsistent").

Ben is essentially opposed (with certain exceptions) to having multiple devices represented in an array in a single metric set. He wants a metric set per device.

@morrone, I think we are on the same page, but perhaps fussing over the words. If a single thread has samplers A, B, and C and B is a pig, then C's sample callback will be delayed by B. So the point is only that the new capability allows you to put B in it's own thread and let it languish on it's own without impacting C. Now, if you insist on putting all interfaces in a single set with a list of records, then yes, this doesn't work. But that is the principally what we are trying to solve with multi-configuration plugins. We can have the slower/unpredictable interfaces in one set, and the more resilient/predictable ones in another.

@morrone, I think we are on the same page, but perhaps fussing over the words. If a single thread has samplers A, B, and C and B is a pig, then C's sample callback will be delayed by B. So the point is only that the new capability allows you to put B in it's own thread and let it languish on it's own without impacting C. Now, if you insist on putting all interfaces in a single set with a list of records, then yes, this doesn't work. But that is the principally what we are trying to solve with multi-configuration plugins. We can have the slower/unpredictable interfaces in one set, and the more resilient/predictable ones in another.

Multi-configuration can help Ben's situation, absolutely. It would do that by explicitly creating configurations that force the devices appear in separate metric sets. But you didn't say that, you only talked about private threading. Private threading is also nice addition, making the intervals more consistent. But that wasn't the issue raised. Maybe in your mind multi-configuration and private threads are one and the same thing, but they seem pretty distinct concepts to me.

But we digress quite a bit. :)

We do digress mightily. If its important, the highlight should be moved to some discussion thread.
@tom95858 @morrone multi-instance stuff would not be used to configure around a known bad device; we're going to configure 1 sampler for all devices of a class or 1 sampler per device (with a strong hatred of this, as enumerating devices in a system configuration file is a fine way to lose one's hair; the devices existing tend to vary per node type or time within a cluster).

is working on that right now and I believe he has a WIP pull request where it can be reviewed.

I don't see a PR for that.

baallan · 2025-03-27T15:00:46Z

@tom95858 I'm delighted to hear that @narategithub is bringing multiple independent sets per sampler to the new API. Let also not forget that in that scenario there are possibly multiple independent schema per sampler instance. The ibnet sampler is a concrete case of this.

morrone · 2025-03-27T15:54:47Z

@tom95858 I'm delighted to hear that @narategithub is bringing multiple independent sets per sampler to the new API. Let also not forget that in that scenario there are possibly multiple independent schema per sampler instance. The ibnet sampler is a concrete case of this.

@baallan Unless you are bringing up some other new feature that hasn't been mentioned here yet (and maybe I'm not aware of?):

Just to be clear, we can already do multiple metric sets per sampler. The new "multi-configuration" API permits multiple complete, independent configurations of a given plugin. There isn't, to my knowledge, anything to keep in mind specifically about schemas. It will largely be incumbent upon the user to make sure that, when they configure each configuration of the same plugin, to tell it to use either the same or different schema names, as appropriate. If you want to have the same schema in each configuration, then the schema names would be the same. If the configuration results in each configuration instance having a different actual schema, then obviously each would need to be configured with different schema names.

This is really pretty similar to configuring the same sampler plugin across different nodes. The user need to take care to give the same schemas the same name, and not use the same schema name for different schemas.

Generally speaking, the plugin API doesn't deal directly with schemas or metric sets (at least for samplers). There used to be a "get_set()" function (or something similar) but that was removed because it wasn't used anymore, and didn't really fit with modern sampler behavior.

baallan · 2025-03-27T17:02:21Z

@morrone, @tom95858 said:

'it is probably time to create a new "base class" with support for multiple sets configurable tenant data etc... @narategithub is working on that right now'

I took this to mean the sampler_base, or whatever the new thing is called, since the old one has to work through the transition period, allows the user to:

simply make multiple instances (eg. per device, there's an instance)
make records per device (as now)
make one instance produce multiple sets (one per device)

And i assert that if 3 is supported by the new api, instead of having to poke fingers into struct base_sampler, we should make sure that the same sampler can produce multiple sets of diverse schemas, not just a single schema, using the api. As an example, the ibnet sampler produces sets (10s-100s) with a port perf data schema and another set with a sample timing performance summary profile schema.
This will come particularly handy with the "default plugin" producing many sets for different metric groups.

But again, way off topic....

morrone · 2025-03-27T17:09:35Z

'it is probably time to create a new "base class" with support for multiple sets configurable tenant data etc... @narategithub is working on that right now'

Oh, sorry! I missed that.

morrone · 2025-03-27T17:16:30Z

But again, way off topic....

Right, right. Back on topic: this PR looks good to me. :)

added use of base, when configured, in sampling.

30d22de

before jobid data didn't get set if use_base was set.

baallan added this to the v4.4.6 milestone Mar 3, 2025

tom95858 reviewed Mar 5, 2025

View reviewed changes

morrone approved these changes Mar 27, 2025

View reviewed changes

baallan mentioned this pull request Apr 15, 2025

dcgm_sampler: reinitialize the dcgm library when needed #1646

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added use of base, when configured, in dcgm sampling. #1623

added use of base, when configured, in dcgm sampling. #1623

Uh oh!

baallan commented Mar 3, 2025

Uh oh!

tom95858 Mar 5, 2025

Uh oh!

baallan Mar 5, 2025

Uh oh!

morrone Mar 26, 2025

Uh oh!

tom95858 Mar 26, 2025

Uh oh!

baallan Mar 26, 2025

Uh oh!

morrone Mar 26, 2025

Uh oh!

tom95858 Mar 26, 2025

Uh oh!

morrone Mar 26, 2025

Uh oh!

baallan Mar 27, 2025

Uh oh!

morrone Mar 27, 2025

Uh oh!

baallan commented Mar 27, 2025

Uh oh!

morrone commented Mar 27, 2025

Uh oh!

baallan commented Mar 27, 2025

Uh oh!

morrone commented Mar 27, 2025

Uh oh!

morrone commented Mar 27, 2025

Uh oh!

Uh oh!

added use of base, when configured, in dcgm sampling. #1623

Are you sure you want to change the base?

added use of base, when configured, in dcgm sampling. #1623

Uh oh!

Conversation

baallan commented Mar 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baallan commented Mar 27, 2025

Uh oh!

morrone commented Mar 27, 2025

Uh oh!

baallan commented Mar 27, 2025

Uh oh!

morrone commented Mar 27, 2025

Uh oh!

morrone commented Mar 27, 2025

Uh oh!

Uh oh!