-
Notifications
You must be signed in to change notification settings - Fork 56
added use of base, when configured, in dcgm sampling. #1623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: b4.4
Are you sure you want to change the base?
Conversation
before jobid data didn't get set if use_base was set.
@@ -1,3 +1,3 @@ | |||
load name=dcgm_sampler | |||
config name=dcgm_sampler producer=localhost${i} schema=${testname} instance=localhost${i}/${testname} component_id=${i} interval=1000000 perm=757 uid=3556 gid=3556 job_set=instance=localhost${i}/job_info use_base=1 | |||
config name=dcgm_sampler producer=localhost${i} schema=${testname} instance=localhost${i}/${testname} component_id=${i} interval=1000000 perm=757 uid=3556 gid=3556 job_set=localhost${i}/jobid use_base=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but I am missing why the use of the base class is optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was optional because @morrone coded it that way. In some previous PR, I added the use_base option (the implementation of which was evidently incomplete, since this patch fixes a bug). The original sampler_base implementation doesn't contemplate convenient multiple set semantics, but exposes the base struct so all good things are obtainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was the first thing I ever wrote for LDMS, back before I'd learned about all the unwritten rules for samplers. :)
Engaging in layering violations (fiddling with sampler_base's internal struct) is a pretty big kludge, but is fine for now. Ultimately, we really need to either give sampler_base support for multiple metric sets per schema, or we overhaul dcgm_sampler to use a single metric set with an array for the multiple devices.
It seems like one-metric-set-per-schema is more-or-less the new "LDMS Way", now that we have arrays and decomposition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baallan, @morrone, it is probably time to create a new "base class" with support for multiple sets configurable tenant data etc... @narategithub is working on that right now and I believe he has a WIP pull request where it can be reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt dcgm, ib switch, and other devices that are potentially considerably slower to query, or even timeout-y, I would prefer we keep the set-per-device design so that the duration time (the window of set inconsistency) is per-device instead of the larger time of all device collections summed. This helps distinguish problematic devices in ib, and possibly in gpus. If a device in a clump gets slow to answer, we may never get a set consistency window even though other devices are fine.
I agree there may be devices which make sense as record arrays because the data query looks at currently cached kernel data instead of poking hardware (e.g. I don't personally know if local disk arrays fall in this category or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this is the reason why the sampler threading was refactored for 4.5.1 to add the ability to assign these types of samplers to their own thread so that they don't impact the sampling latency of the other samplers.
@tom95858 I don't think that is what @baallan was talking about. The sampler dedicated-thread code, to the best of my understanding, addresses interval jitter (what I assume you are referring to by "latency"). It allows sample() for a given plugin instance/configuration to be called at the scheduled time a little more accurately. This is because it has its own thread, instead of sample() being called sequentially for multiple samplers that have the same scheduled sample() time. correct?
That wouldn't have any change on the time spent inside sample(), which is what Ben was talking about. Even more specifically he is concerned about a subset of the sample() time: the time spent modifying a particular metric set (the time it is marked "inconsistent").
Ben is essentially opposed (with certain exceptions) to having multiple devices represented in an array in a single metric set. He wants a metric set per device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morrone, I think we are on the same page, but perhaps fussing over the words. If a single thread has samplers A, B, and C and B is a pig, then C's sample callback will be delayed by B. So the point is only that the new capability allows you to put B in it's own thread and let it languish on it's own without impacting C. Now, if you insist on putting all interfaces in a single set with a list of records, then yes, this doesn't work. But that is the principally what we are trying to solve with multi-configuration plugins. We can have the slower/unpredictable interfaces in one set, and the more resilient/predictable ones in another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@morrone, I think we are on the same page, but perhaps fussing over the words. If a single thread has samplers A, B, and C and B is a pig, then C's sample callback will be delayed by B. So the point is only that the new capability allows you to put B in it's own thread and let it languish on it's own without impacting C. Now, if you insist on putting all interfaces in a single set with a list of records, then yes, this doesn't work. But that is the principally what we are trying to solve with multi-configuration plugins. We can have the slower/unpredictable interfaces in one set, and the more resilient/predictable ones in another.
Multi-configuration can help Ben's situation, absolutely. It would do that by explicitly creating configurations that force the devices appear in separate metric sets. But you didn't say that, you only talked about private threading. Private threading is also nice addition, making the intervals more consistent. But that wasn't the issue raised. Maybe in your mind multi-configuration and private threads are one and the same thing, but they seem pretty distinct concepts to me.
But we digress quite a bit. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do digress mightily. If its important, the highlight should be moved to some discussion thread.
@tom95858 @morrone multi-instance stuff would not be used to configure around a known bad device; we're going to configure 1 sampler for all devices of a class or 1 sampler per device (with a strong hatred of this, as enumerating devices in a system configuration file is a fine way to lose one's hair; the devices existing tend to vary per node type or time within a cluster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is working on that right now and I believe he has a WIP pull request where it can be reviewed.
I don't see a PR for that.
@tom95858 I'm delighted to hear that @narategithub is bringing multiple independent sets per sampler to the new API. Let also not forget that in that scenario there are possibly multiple independent schema per sampler instance. The ibnet sampler is a concrete case of this. |
@baallan Unless you are bringing up some other new feature that hasn't been mentioned here yet (and maybe I'm not aware of?): Just to be clear, we can already do multiple metric sets per sampler. The new "multi-configuration" API permits multiple complete, independent configurations of a given plugin. There isn't, to my knowledge, anything to keep in mind specifically about schemas. It will largely be incumbent upon the user to make sure that, when they configure each configuration of the same plugin, to tell it to use either the same or different schema names, as appropriate. If you want to have the same schema in each configuration, then the schema names would be the same. If the configuration results in each configuration instance having a different actual schema, then obviously each would need to be configured with different schema names. This is really pretty similar to configuring the same sampler plugin across different nodes. The user need to take care to give the same schemas the same name, and not use the same schema name for different schemas. Generally speaking, the plugin API doesn't deal directly with schemas or metric sets (at least for samplers). There used to be a "get_set()" function (or something similar) but that was removed because it wasn't used anymore, and didn't really fit with modern sampler behavior. |
'it is probably time to create a new "base class" with support for multiple sets configurable tenant data etc... @narategithub is working on that right now' I took this to mean the sampler_base, or whatever the new thing is called, since the old one has to work through the transition period, allows the user to:
And i assert that if 3 is supported by the new api, instead of having to poke fingers into struct base_sampler, we should make sure that the same sampler can produce multiple sets of diverse schemas, not just a single schema, using the api. As an example, the ibnet sampler produces sets (10s-100s) with a port perf data schema and another set with a sample timing performance summary profile schema. But again, way off topic.... |
Oh, sorry! I missed that. |
Right, right. Back on topic: this PR looks good to me. :) |
before jobid data didn't get set if use_base was set.