-
Notifications
You must be signed in to change notification settings - Fork 63
Guidance on Generative AI usage in Kubernetes Github Orgs #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
for those of you already part of the |
Unfortunately not, I can't see this. But I don't think we should be driven by non-public non-kubernetes-org discussions anyhow. There was a recent thread in our slack however: https://kubernetes.slack.com/archives/C1TU9EB9S/p1744211437481679
My understanding was that the CNCF communicated the LF's guidance, and that this was the main outcome of the initial flurry of discussions around this space. At least, that's what I recall from discussing previously via our GB rep, cc @cblecker.
For the copyright angle we fundamentally have to ensure that contributions are CLA-ed and good quality even if code is copy-pasted from stack overflow instead of via genAI. I think https://www.linuxfoundation.org/legal/generative-ai covers the expectations around the CLA. Regarding quality bar, spamminess, etc, generated, copy-pasted, human authored ... all need to be held to our standards. If there are gaps in our standards, we should address those directly. We cannot actually know how content was authored reliably (if you could accurately detect AI generated content, you have created the perfect oracle for training the next model ...), but we can expect that it be reasonable, non-spammy, high quality, respectful, and accurately attributed / copyrighted. I think the most productive approach is to focus on general content and behavior policy. Looking for example at: https://devguide.python.org/getting-started/generative-ai/
This example seems relatively redundant to having a more general reasonable contribution / moderation policy. We already have a standing enforcement that spam / poor quality PRs will be closed and that repeated spam will result in blocking by our github moderation team. I think Kubernetes might need to better document those though, as I'm having difficulty turning up a document that spells this out for github as opposed to slack/discus/... @kubernetes/owners is there a doc I'm missing that actually spells out how we moderate github? cc @kubernetes/steering-committee |
Per the thread, which had some tangents: LLMs are just a special case of "repeated bad-faith/inadequate submissions". It's not qualitatively different from Hacktoberfest or SO copy-and-paste, or any of the "I removed a single comma" PRs that SIG-Docs gets. LLM may be quantitatively different, but that calls for automation rather than policy. There's also the fact that current automated LLM-detection tools have disappointingly low accuracy. Also, note that many younger new contributors will not be aware that there's anything wrong with having an LLM generate a PR. They need to be educated. And that there are non-problematic uses of LLM (like grammar checking or auditing). My suggestion is that we do two things:
|
The LF guidance seems entirely focused on the legal issues, but what about the spamminess issues? I just reviewed a PR that looked good at a high level, but has some issues once you look more closely, and I'm pretty sure that the submitter has no understanding of what those issues are, and would not be able to address them if I pointed them out (either by rewriting the code themselves or instructing their AI tool in a way that would result in the correct changes). Meaning that they have just wasted my time, because this PR will never turn into a valid contribution. I think we should say that (for code PRs at least) we do not accept AI-generated PRs from non-org-members. (And make that part of the CLA.) |
This is discussed above. Spamminess issues aren't new to the org with gen-ai, we've received all sorts of generated spam behavior before this and excessive spammy behavior (especially obviously automated) results in a ban from our org moderators. I don't know if we actually have this in writing for GitHub yet, but it needs to apply to any automated spamminess. nudge @kubernetes/owners again -- do we have this written down somewhere?
To be clear: the LF operates the CLA system, which is focused on licensing and compliance. I don't think we cram this in there. |
Spammy "change one word" PRs are annoying but take 10 seconds of your time. I spent an hour reviewing a PR from this same person yesterday, who I had assumed was an eager new contributor who just needed some help getting familiar with our codebase and conventions. But now I assume that they'll just copy my review into their AI tool, have it regenerate the patch, and then push an update with a different set of problems that they don't understand. IMO, a single AI-generated code PR should be considered already over the threshold for "excessive spammy behavior".
Ah, well, we can at least get them to update it with their own stance on AI-generated code and licensing/compliance if they haven't already... |
@BenTheElder, there's some wording around (umbrella) spam moderation in the moderation guidelines document here, but nothing that specifies the kind(s) of spam – https://github.com/kubernetes/community/blob/master/communication/moderation.md?plain=1#L134-L135 |
I wasn't actually talking about those in particular, we've been getting spam with everything from advertising to bizzare inexplicable nonsense and poorly generated noise since as long as Kubernetes was popular. We moderate those. I think the main difference is moderating those quickly was easier because they didn't remotely look like authentic good faith behavior. So we should have a policy for that, but we seem to be enacting a mostly unwritten one. That needs fixing regardless.
"Repeatedly push a hacky patch they don't understand" is common enough with new drive-by contributors without genAI and isn't inherently a moderation worthy "offense". If we make it an "offense" only when AI is "detected", see above. This is fraught with false positives and assumes poorly of contributors.
I don't think we can do that, we also don't ban people for a single instance of mass trivial typo fix PRs. We just correct the expectations by asking them to help us automatically correct the typos and/or consolidate their PRs. We don't reward the behavior but we also don't escalate until after we've raised it with them, because it's entirely possible they were actually trying to be helpful.
Right, and that doc also doesn't clearly scope GitHub versus slack etc. I think we clearly need written guidance, the rest of the GitHub management program has reasonably detailed docs with clear policies. Let's split that off as a starting point to capture our current approach more clearly. Then we can identify what, if any, changes are necessary. |
For PRs we also have https://www.kubernetes.dev/docs/guide/pull-requests/#trivial-edits and recently extended it with an explanation about fixing linter issues as that was another source of unwanted PRs. I'm not sure if or how "don't submit AI-generated code that you don't understand" fits there, but perhaps somewhere above it? |
Assuming that this discussion is not specific to code, we have wording specific to SIG Docs localisation contributions here, whereby we caution contributors to not rely on machine-generated translations alone since it often doesn’t meet our quality standards. There’s also an adjacent discussion that’s ongoing between SIG ContribEx and SIG Docs regarding setting guardrails around submitting AI-generated content as blogs or docs, given how we’ve seen a spike in those kind of submissions since 2023 and the guidelines in our contribution docs haven’t been sufficient to address those occurrences. |
OK, maybe what I want is for us to try to make it clearer that submitting AI-generated PRs is often not going to be helpful. |
Yeah, I think we need to make this clear somewhere. Young contributors these days have been subject to a lot of advertising that AI will give them a jump ahead in learning/knowledge/work, some of it from folks they should be able to trust. I think we need to document for them not only that it's problematic, but WHY -- especially since the reasons such contributions are problematic aren't limited to genAI ones. This should probably live in each of the contributor guides (and the contributor tutorial for Kuberentes). |
Split out kubernetes/community#8439 regarding documenting github moderation policies so we can then iterate on that aspect.
Probably best in one of the guides linked from this comment which we post to PRs from new contributors: It looks like currently we link to the non-rendered form at https://git.k8s.io/community/contributors/guide/pull-requests.md, we should update that to https://www.kubernetes.dev/docs/guide/pull-requests, I'll send that cleanup now ... |
This topic was discussed in public steering meeting on 5/07 (recording) and the steering committee agreed that we don't want to explicitly provide any guidance targeting Generative AI usage. We will defer to SIGs to provide such guidance where appropriate. Although, we will provide a wording (@pohly volunteered to take the initial take) similar to https://www.kubernetes.dev/docs/guide/pull-requests/#trivial-edits which reviewers and approvers will be able to use when closing PRs. |
This primarily came out of the discussion around allowing the use of LLMs (kubernetes/steering#291), but isn't limited to it because other tools (search/replace, linters) can have the same effect. The goal is to clarify expected behavior and to give reviewers something that they can link to when they decide that a PR shouldn't get merged.
This primarily came out of the discussion around allowing the use of LLMs (kubernetes/steering#291), but isn't limited to it because other tools (search/replace, linters) can have the same effect. The goal is to clarify expected behavior and to give reviewers something that they can link to when they decide that a PR shouldn't get merged.
@pohly volunteered to take the initial take |
This primarily came out of the discussion around allowing the use of LLMs (kubernetes/steering#291), but isn't limited to it because other tools (search/replace, linters) can have the same effect. The goal is to clarify expected behavior and to give reviewers something that they can link to when they decide that a PR shouldn't get merged.
This primarily came out of the discussion around allowing the use of LLMs (kubernetes/steering#291), but isn't limited to it because other tools (search/replace, linters) can have the same effect. The goal is to clarify expected behavior and to give reviewers something that they can link to when they decide that a PR shouldn't get merged.
This primarily came out of the discussion around allowing the use of LLMs (kubernetes/steering#291), but isn't limited to it because other tools (search/replace, linters) can have the same effect. The goal is to clarify expected behavior and to give reviewers something that they can link to when they decide that a PR shouldn't get merged.
LF has guidance here:
https://www.linuxfoundation.org/legal/generative-ai
(CNCF does not have one yet)
Ladybird has some language:
https://github.com/LadybirdBrowser/ladybird/blob/master/CONTRIBUTING.md#on-usage-of-ai-and-llms
Python has some language:
https://devguide.python.org/getting-started/generative-ai/
What is our (wearing my k8s hat!) position here? what can we write down that we can point people to as inevitably we will see more and more of it?
The text was updated successfully, but these errors were encountered: