Skip to content

Guidance on Generative AI usage in Kubernetes Github Orgs #291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dims opened this issue Apr 16, 2025 · 15 comments · May be fixed by kubernetes/community#8451
Open

Guidance on Generative AI usage in Kubernetes Github Orgs #291

dims opened this issue Apr 16, 2025 · 15 comments · May be fixed by kubernetes/community#8451
Labels
committee/steering Denotes an issue or PR intended to be handled by the steering committee.

Comments

@dims
Copy link
Member

dims commented Apr 16, 2025

LF has guidance here:
https://www.linuxfoundation.org/legal/generative-ai

(CNCF does not have one yet)

Ladybird has some language:
https://github.com/LadybirdBrowser/ladybird/blob/master/CONTRIBUTING.md#on-usage-of-ai-and-llms

Python has some language:
https://devguide.python.org/getting-started/generative-ai/

What is our (wearing my k8s hat!) position here? what can we write down that we can point people to as inevitably we will see more and more of it?

@dims
Copy link
Member Author

dims commented Apr 16, 2025

for those of you already part of the community/maintainers repo, this issue was triggered by discussion:
https://github.com/community/maintainers/discussions/470

@BenTheElder
Copy link
Member

for those of you already part of the community/maintainers repo, this issue was triggered by discussion:
https://github.com/community/maintainers/discussions/470

Unfortunately not, I can't see this. But I don't think we should be driven by non-public non-kubernetes-org discussions anyhow.

There was a recent thread in our slack however: https://kubernetes.slack.com/archives/C1TU9EB9S/p1744211437481679

(CNCF does not have one yet)

My understanding was that the CNCF communicated the LF's guidance, and that this was the main outcome of the initial flurry of discussions around this space.

At least, that's what I recall from discussing previously via our GB rep, cc @cblecker.

What is our (wearing my k8s hat!) position here? what can we write down that we can point people to as inevitably we will see more and more of it?

For the copyright angle we fundamentally have to ensure that contributions are CLA-ed and good quality even if code is copy-pasted from stack overflow instead of via genAI.

I think https://www.linuxfoundation.org/legal/generative-ai covers the expectations around the CLA.
We could link this in more places, but ... where exactly? (thinking ... suggestions welcome ...)

Regarding quality bar, spamminess, etc, generated, copy-pasted, human authored ... all need to be held to our standards.
Even before genAI became popular, Kubernetes received a lot of strange, seemingly auto-generated spam and has consistently moderated it and banned users that repeatedly engage in this.

If there are gaps in our standards, we should address those directly.

We cannot actually know how content was authored reliably (if you could accurately detect AI generated content, you have created the perfect oracle for training the next model ...), but we can expect that it be reasonable, non-spammy, high quality, respectful, and accurately attributed / copyrighted.

I think the most productive approach is to focus on general content and behavior policy.


Looking for example at: https://devguide.python.org/getting-started/generative-ai/

Acceptable uses

Some of the acceptable uses of generative AI include:

Assistance with writing comments, especially in a non-native language

Gaining understanding of existing code

Supplementing contributor knowledge for code, tests, and documentation

Unacceptable uses

Maintainers may close issues and PRs that are not useful or productive, including those that are fully generated by AI. If a contributor repeatedly opens unproductive issues or PRs, they may be blocked.

This example seems relatively redundant to having a more general reasonable contribution / moderation policy.

We already have a standing enforcement that spam / poor quality PRs will be closed and that repeated spam will result in blocking by our github moderation team.

I think Kubernetes might need to better document those though, as I'm having difficulty turning up a document that spells this out for github as opposed to slack/discus/...

@kubernetes/owners is there a doc I'm missing that actually spells out how we moderate github?

cc @kubernetes/steering-committee

@BenTheElder BenTheElder added the committee/steering Denotes an issue or PR intended to be handled by the steering committee. label Apr 16, 2025
@jberkus
Copy link
Contributor

jberkus commented Apr 16, 2025

Per the thread, which had some tangents: LLMs are just a special case of "repeated bad-faith/inadequate submissions". It's not qualitatively different from Hacktoberfest or SO copy-and-paste, or any of the "I removed a single comma" PRs that SIG-Docs gets. LLM may be quantitatively different, but that calls for automation rather than policy.

There's also the fact that current automated LLM-detection tools have disappointingly low accuracy.

Also, note that many younger new contributors will not be aware that there's anything wrong with having an LLM generate a PR. They need to be educated. And that there are non-problematic uses of LLM (like grammar checking or auditing).

My suggestion is that we do two things:

  1. Review contributor docs to make sure that we clearly call out bad-faith/inadequate contributions (all types, not just LLM), explain what they are and why they are not permitted (and will be rejected and banned if repeated).

  2. Keep an eye on LLM-generated submissions, and figure out when we need to develop automation to prevent volume from being a burden on triagers/reviewers. Blocking the specific ones we're getting is going to be more solvable than all GenAi in general.

@danwinship
Copy link

The LF guidance seems entirely focused on the legal issues, but what about the spamminess issues?

I just reviewed a PR that looked good at a high level, but has some issues once you look more closely, and I'm pretty sure that the submitter has no understanding of what those issues are, and would not be able to address them if I pointed them out (either by rewriting the code themselves or instructing their AI tool in a way that would result in the correct changes). Meaning that they have just wasted my time, because this PR will never turn into a valid contribution.

I think we should say that (for code PRs at least) we do not accept AI-generated PRs from non-org-members. (And make that part of the CLA.)

@BenTheElder
Copy link
Member

issues, but what about the spamminess issues?

This is discussed above. Spamminess issues aren't new to the org with gen-ai, we've received all sorts of generated spam behavior before this and excessive spammy behavior (especially obviously automated) results in a ban from our org moderators.

I don't know if we actually have this in writing for GitHub yet, but it needs to apply to any automated spamminess.

nudge @kubernetes/owners again -- do we have this written down somewhere?

And make that part of the CLA

To be clear: the LF operates the CLA system, which is focused on licensing and compliance. I don't think we cram this in there.

@danwinship
Copy link

Spammy "change one word" PRs are annoying but take 10 seconds of your time. I spent an hour reviewing a PR from this same person yesterday, who I had assumed was an eager new contributor who just needed some help getting familiar with our codebase and conventions. But now I assume that they'll just copy my review into their AI tool, have it regenerate the patch, and then push an update with a different set of problems that they don't understand.

IMO, a single AI-generated code PR should be considered already over the threshold for "excessive spammy behavior".

To be clear: the LF operates the CLA system, which is focused on licensing and compliance. I don't think we cram this in there.

Ah, well, we can at least get them to update it with their own stance on AI-generated code and licensing/compliance if they haven't already...

@Priyankasaggu11929
Copy link
Member

issues, but what about the spamminess issues?

This is discussed above. Spamminess issues aren't new to the org with gen-ai, we've received all sorts of generated spam behavior before this and excessive spammy behavior (especially obviously automated) results in a ban from our org moderators.

I don't know if we actually have this in writing for GitHub yet, but it needs to apply to any automated spamminess.

nudge @kubernetes/owners again -- do we have this written down somewhere?

@BenTheElder, there's some wording around (umbrella) spam moderation in the moderation guidelines document here, but nothing that specifies the kind(s) of spam – https://github.com/kubernetes/community/blob/master/communication/moderation.md?plain=1#L134-L135

@BenTheElder
Copy link
Member

Spammy "change one word" PRs are annoying but take 10 seconds of your time.

I wasn't actually talking about those in particular, we've been getting spam with everything from advertising to bizzare inexplicable nonsense and poorly generated noise since as long as Kubernetes was popular. We moderate those. I think the main difference is moderating those quickly was easier because they didn't remotely look like authentic good faith behavior.

So we should have a policy for that, but we seem to be enacting a mostly unwritten one. That needs fixing regardless.

I spent an hour reviewing a PR from this same person yesterday, who I had assumed was an eager new contributor who just needed some help getting familiar with our codebase and conventions. But now I assume that they'll just copy my review into their AI tool, have it regenerate the patch, and then push an update with a different set of problems that they don't understand.

"Repeatedly push a hacky patch they don't understand" is common enough with new drive-by contributors without genAI and isn't inherently a moderation worthy "offense".

If we make it an "offense" only when AI is "detected", see above. This is fraught with false positives and assumes poorly of contributors.

IMO, a single AI-generated code PR should be considered already over the threshold for "excessive spammy behavior".

I don't think we can do that, we also don't ban people for a single instance of mass trivial typo fix PRs.

We just correct the expectations by asking them to help us automatically correct the typos and/or consolidate their PRs. We don't reward the behavior but we also don't escalate until after we've raised it with them, because it's entirely possible they were actually trying to be helpful.

@BenTheElder, there's some wording around (umbrella) spam moderation in the moderation guidelines document here, but nothing that specifies the kind(s) of spam

Right, and that doc also doesn't clearly scope GitHub versus slack etc. I think we clearly need written guidance, the rest of the GitHub management program has reasonably detailed docs with clear policies. Let's split that off as a starting point to capture our current approach more clearly.

Then we can identify what, if any, changes are necessary.

@pohly
Copy link
Contributor

pohly commented Apr 22, 2025

For PRs we also have https://www.kubernetes.dev/docs/guide/pull-requests/#trivial-edits and recently extended it with an explanation about fixing linter issues as that was another source of unwanted PRs.

I'm not sure if or how "don't submit AI-generated code that you don't understand" fits there, but perhaps somewhere above it?

@divya-mohan0209
Copy link

divya-mohan0209 commented Apr 22, 2025

Assuming that this discussion is not specific to code, we have wording specific to SIG Docs localisation contributions here, whereby we caution contributors to not rely on machine-generated translations alone since it often doesn’t meet our quality standards. There’s also an adjacent discussion that’s ongoing between SIG ContribEx and SIG Docs regarding setting guardrails around submitting AI-generated content as blogs or docs, given how we’ve seen a spike in those kind of submissions since 2023 and the guidelines in our contribution docs haven’t been sufficient to address those occurrences.

@danwinship
Copy link

it's entirely possible they were actually trying to be helpful

OK, maybe what I want is for us to try to make it clearer that submitting AI-generated PRs is often not going to be helpful.

@jberkus
Copy link
Contributor

jberkus commented Apr 28, 2025

OK, maybe what I want is for us to try to make it clearer that submitting AI-generated PRs is often not going to be helpful.

Yeah, I think we need to make this clear somewhere. Young contributors these days have been subject to a lot of advertising that AI will give them a jump ahead in learning/knowledge/work, some of it from folks they should be able to trust. I think we need to document for them not only that it's problematic, but WHY -- especially since the reasons such contributions are problematic aren't limited to genAI ones. This should probably live in each of the contributor guides (and the contributor tutorial for Kuberentes).

@BenTheElder
Copy link
Member

BenTheElder commented Apr 28, 2025

Split out kubernetes/community#8439 regarding documenting github moderation policies so we can then iterate on that aspect.

For PRs we also have https://www.kubernetes.dev/docs/guide/pull-requests/#trivial-edits and recently extended it with an explanation about fixing linter issues as that was another source of unwanted PRs.

I'm not sure if or how "don't submit AI-generated code that you don't understand" fits there, but perhaps somewhere above it?

Probably best in one of the guides linked from this comment which we post to PRs from new contributors:

https://github.com/kubernetes/test-infra/blob/2609879bc1bafee98af45e43d1927841a49eb87c/config/prow/plugins.yaml#L699

It looks like currently we link to the non-rendered form at https://git.k8s.io/community/contributors/guide/pull-requests.md, we should update that to https://www.kubernetes.dev/docs/guide/pull-requests, I'll send that cleanup now ...
EDIT: kubernetes/test-infra#34746

@soltysh
Copy link
Contributor

soltysh commented May 7, 2025

This topic was discussed in public steering meeting on 5/07 (recording) and the steering committee agreed that we don't want to explicitly provide any guidance targeting Generative AI usage. We will defer to SIGs to provide such guidance where appropriate. Although, we will provide a wording (@pohly volunteered to take the initial take) similar to https://www.kubernetes.dev/docs/guide/pull-requests/#trivial-edits which reviewers and approvers will be able to use when closing PRs.

pohly added a commit to pohly/community that referenced this issue May 8, 2025
This primarily came out of the discussion around allowing the use of
LLMs (kubernetes/steering#291), but isn't limited to
it because other tools (search/replace, linters) can have the same effect.

The goal is to clarify expected behavior and to give reviewers something that
they can link to when they decide that a PR shouldn't get merged.
pohly added a commit to pohly/community that referenced this issue May 8, 2025
This primarily came out of the discussion around allowing the use of
LLMs (kubernetes/steering#291), but isn't limited to
it because other tools (search/replace, linters) can have the same effect.

The goal is to clarify expected behavior and to give reviewers something that
they can link to when they decide that a PR shouldn't get merged.
@pohly
Copy link
Contributor

pohly commented May 8, 2025

@pohly volunteered to take the initial take

See kubernetes/community#8451

pohly added a commit to pohly/community that referenced this issue May 9, 2025
This primarily came out of the discussion around allowing the use of
LLMs (kubernetes/steering#291), but isn't limited to
it because other tools (search/replace, linters) can have the same effect.

The goal is to clarify expected behavior and to give reviewers something that
they can link to when they decide that a PR shouldn't get merged.
pohly added a commit to pohly/community that referenced this issue May 14, 2025
This primarily came out of the discussion around allowing the use of
LLMs (kubernetes/steering#291), but isn't limited to
it because other tools (search/replace, linters) can have the same effect.

The goal is to clarify expected behavior and to give reviewers something that
they can link to when they decide that a PR shouldn't get merged.
pohly added a commit to pohly/community that referenced this issue May 15, 2025
This primarily came out of the discussion around allowing the use of
LLMs (kubernetes/steering#291), but isn't limited to
it because other tools (search/replace, linters) can have the same effect.

The goal is to clarify expected behavior and to give reviewers something that
they can link to when they decide that a PR shouldn't get merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
committee/steering Denotes an issue or PR intended to be handled by the steering committee.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants