Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(nodeadm): rework containerd template mixin interface #2135

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ndbaker1
Copy link
Member

@ndbaker1 ndbaker1 commented Feb 4, 2025

Issue #, if available:

Description of changes:

In order to recognize when the running instance has GPU devices, we currently use a hard coded list of instance type families. to keep this list updated or remove it we can do one of the following:

  1. use workflows to generate PRs for the changes. this isn't so bad but introduces a build time dependency
  2. make api calls to ec2.DescribeInstanceTypes to get whether the current instance has a GPU device
  3. read local data on the instance to determine if there is an nvidia device attached

this PR implements (3) from above by looking for the nvidia vendor name in /proc/bus/pcie/device. the original logic is left in and PCIe devices are used as a fallback.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Testing Done

See this guide for recommended testing for PRs. Some tests may not apply. Completing tests and providing additional validation steps are not required, but it is recommended and may reduce review time and time to merge.

@cartermckinnon cartermckinnon changed the title refactor(nodeadm): PCIe detection for GPU instance types feat(nodeadm): PCIe detection for GPU instance types Feb 4, 2025
Copy link
Member

@cartermckinnon cartermckinnon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit too much refactoring mixed in with the logical changes, IMO; I think this PR can be split

if err := writeBaseRuntimeSpec(cfg); err != nil {
return err
}
type configurator interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is a little clunky, I liked mixin in the prev impl 😉 or maybe modifier?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill try going with modifier

Comment on lines +23 to +25
if err := writeBaseRuntimeSpec(c); err != nil {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this about? Doesn't seem related to the other changes

Copy link
Member Author

@ndbaker1 ndbaker1 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The top level daemon config calls should contain separate components and i didn't like how the runtime spec was nested inside the config.toml call.

i think the kubelet daemon sets the right precedent

func (k *kubelet) Configure(cfg *api.NodeConfig) error {
if err := k.writeKubeletConfig(cfg); err != nil {
return err
}
if err := k.writeKubeconfig(cfg); err != nil {
return err
}
if err := k.writeImageCredentialProviderConfig(cfg); err != nil {
return err
}
if err := writeClusterCaCert(cfg.Spec.Cluster.CertificateAuthority); err != nil {
return err
}
if err := k.writeKubeletEnvironment(cfg); err != nil {
return err
}
return nil

case 1:
configurators[0].Transform(&configVars)
default:
zap.L().Error("Skipping configuration because more than one configurator was matched", zap.Any("configurators", configurators))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be an error; configurator-s should be composable or there's not much benefit to the abstraction.

Copy link
Member Author

@ndbaker1 ndbaker1 Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this looks funky, mainly was a byproduct of trying to continue using the abstraction without a great reason. It doesn't work like a real mixin, it just populates a set of parameters and wouldn't make sense if they got written more than once

I think ill just call them in a loop if it matches rather than doing this additional aggregation + check

nodeadm/internal/containerd/nvidia.go Show resolved Hide resolved
nodeadm/internal/containerd/nvidia_test.go Outdated Show resolved Hide resolved
@ndbaker1 ndbaker1 force-pushed the ctrd-config branch 2 times, most recently from cf5d952 to a537482 Compare February 5, 2025 06:37
@ndbaker1
Copy link
Member Author

ndbaker1 commented Feb 8, 2025

#2146 replaces this PR for PCIe detection functionality, while this will be turned into a refactor for the mixin interface,

@ndbaker1 ndbaker1 changed the title feat(nodeadm): PCIe detection for GPU instance types refactor(nodeadm): rework containerd template mixin interface Feb 8, 2025
@ndbaker1 ndbaker1 removed the request for review from cartermckinnon February 8, 2025 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants