Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Listener registration failing with RunnerScaleSetNotFoundException #3935

Open
4 tasks done
tomhaynes opened this issue Feb 19, 2025 · 10 comments
Open
4 tasks done

Listener registration failing with RunnerScaleSetNotFoundException #3935

tomhaynes opened this issue Feb 19, 2025 · 10 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@tomhaynes
Copy link

Checks

Controller Version

0.9.3,0.10.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This has been happening this morning across our estate. Runnersets that had been happily running and registered are suddenly failing with this.

Describe the bug

Having previously been healthy, our listeners are failing to register with the Github API, throwing the following error:

2025/02/19 07:05:16 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "c16014ee-8847-41ae-87e9-0feb3061b89a": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.

This causes them to repeatedly retry until we have exhausted our API limits, causing all runners to cease to work.

Describe the expected behavior

Successful registration

Additional Context

-

Controller Logs

2025-02-19 12:29:28.363	
2025-02-19T12:29:28Z	INFO	AutoscalingListener	Created listener pod	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}, "namespace": "infradev", "name": "dev--647f5fd5-listener"}
2025-02-19 12:29:28.350	
2025-02-19T12:29:28Z	INFO	AutoscalingListener	Creating listener pod	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}, "namespace": "infradev", "name": "dev--647f5fd5-listener"}
2025-02-19 12:29:28.347	
2025-02-19T12:29:28Z	INFO	AutoscalingListener	Creating a listener pod	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}}
2025-02-19 12:29:27.545	
2025-02-19T12:29:27Z	INFO	AutoscalingListener	Listener pod is terminated	{"version": "0.9.3", "autoscalinglistener": {"name":"dev--647f5fd5-listener","namespace":"infradev"}, "namespace": "infradev", "name": "dev--647f5fd5-listener", "reason": "Error", "message": ""}

Runner Pod Logs

025-02-19 12:23:42.746	
2025/02/19 12:23:42 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "7efdb2aa-1b5d-471a-8557-855bd8eeea18": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:30.867	
2025/02/19 12:23:30 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "c0c41d56-8847-41ae-87e9-0feb3061b89a": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:26.668	
2025/02/19 12:23:26 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "fd0647bd-4c98-4217-902b-0b7ca3343818": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:22.694	
2025/02/19 12:23:22 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "fd064c6c-4c98-4217-902b-0b7ca3343818": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:10.578	
2025/02/19 12:23:10 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "9802e18f-984c-4049-ad4b-b3d837c00514": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
2025-02-19 12:23:06.608	
2025/02/19 12:23:06 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "7efd86a8-1b5d-471a-8557-855bd8eeea18": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 6.
@tomhaynes tomhaynes added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Feb 19, 2025
Copy link
Contributor

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@tomhaynes
Copy link
Author

@nikola-jokic following from the problems we saw a few weeks ago, could this be an issue on the Github API side?

@nikola-jokic
Copy link
Collaborator

Hey @tomhaynes,

It seems like the scale set has been removed at 2025-02-19T00:37:00Z. This might be an issue on ARC side. Can you please describe what was happening on the cluster during that time? The controller log and the listener log should help us understand what was going on around that time.

@tomhaynes
Copy link
Author

Hi thanks for the response @nikola-jokic, I'm perhaps being stupid but where do you see that timestamp?

This is a development cluster, and its shutdown at midnight 2025-02-19T00:00:00Z. I did wonder if perhaps its a race condition between the controller shutting down, and it not correctly cleaning up the various CRDs that it controls?

We are also seeing this error now:

ERROR	Reconciler error	{"controller": "autoscalingrunnerset", "controllerGroup": "actions.github.com", "controllerKind": "AutoscalingRunnerSet", "AutoscalingRunnerSet": {"name":"xxx","namespace":"xxx"}, "namespace": "xxx", "name": "xxx", "reconcileID": "89d4d0d7-130c-4c6d-8a02-06d27d1cf277", "error": "failed to get actions service admin connection on refresh: github api error: StatusCode 422, RequestID \"87F4:155BD3:62627C:7A7441:67B5D5B6\": {\"message\":\"Validation Failed\",\"documentation_url\":\"https://docs.github.com/rest\",\"status\":\"422\"}"}

Uninstalling a specific gha-runner-scale-set chart, cleaning up all associated CRDs and reinstalling does appear to resolve the problem for that runner.

Are there any logs to look out for in the controller that might indicate a non-graceful shutdown?

@nikola-jokic
Copy link
Collaborator

We looked into traces on the back-end side to understand what is going on. It is likely the race condition. If the controller shuts down without having enough time to clean up the environment, it can cause issues like this.
Having said that, we should invest more effort into making this kind of issue recoverable. Perhaps, we should try to re-install resources when we notice errors like the one you reported.

As for the log, this is also tricky. Basically, you would have to inspect the log and see that some steps that should be taken are missing. Having said that, it would be a good idea to log as soon as the shutdown is received, so you can spot these issues by checking logs below the termination mark. This solution cannot be perfect, especially when the controller is stopped without any graceful termination period, but it would help to diagnose issues with the cleanup process.

@tomhaynes
Copy link
Author

We've got slightly closer possibly with one of the errors. A runnerset is throwing this error:

2025-02-19 14:11:43.776 | 2025/02/19 14:11:43 Application returned an error: createSession failed: failed to create session: actions error: StatusCode 404, AcivityId "ea91597a-3ca5-498e-a6e6-82ea9d2779b1": GitHub.Actions.Runtime.WebApi.RunnerScaleSetNotFoundException, GitHub.Actions.Runtime.WebApi: No runner scale set found with identifier 1.

And I can see that the "Runner scale set" is missing when I look at the repository in Github UI. What would have removed that runner set on the github side? Could we raise a feature request to have the autoscalingrunnerset recreate it when this happens?

@tomhaynes
Copy link
Author

could it be related to this? actions/runner#756

@tomhaynes
Copy link
Author

we've worked out a semi unpleasant way to force re-registration:

# To re-install the runners in a cluster:
export namespace=$namespace

# Remove the annotations from the autoscalingrunner
kubectl -n $namespace get autoscalingrunnerset --no-headers | awk '{print $1}' | xargs kubectl -n $namespace patch autoscalingrunnersets --type=merge -p='{"metadata": {"annotations": {"actions.github.com/values-hash": null,"runner-scale-set-id": null}}}'

# Remove the ephemeralrunnersets
kubectl -n $namespace get ephemeralrunnerset --no-headers | awk '{print $1}' | xargs  kubectl -n $namespace delete ephemeralrunnerset

# Remove the annotation from the autoscalinglistener
kubectl -n $namespace get autoscalinglistener --no-headers | awk '{print $1}' | xargs kubectl -n $namespace patch autoscalinglistener --type=merge -p='{"metadata": {"annotations": {"actions.github.com/runner-spec-hash": null}}}'

.. which at least avoids the finalizer hell of helm uninstalls. It'd be great to understand what causes the runner sets to disappear on the Github repo side?

Also is there any way to request an API to list the runner sets on the repo? I saw it was raised here #2990 and the raiser was directed to the community page.. I tried and failed to see whether it has been requested there...

@nikola-jokic
Copy link
Collaborator

So the deletion probably occured inside the autoscaling runner set controller. The shell script you just wrote forces the controller to think this is a new installation and re-creates resources properly, removing old resources and starting from scratch.
If you have it, can you please provide the controller log before and after it was terminated. I would love to inspect what was going on and fix this so you don't have to use hacks to recover.

As for the API documentation, we did talk about documenting scale sets APIs, but not just yet. There are some improvements we want to do, and some of them would be considered as breaking changes.

@dagi3d
Copy link

dagi3d commented Feb 21, 2025

solution proposed by @tomhaynes also worked for me as I was facing the exact same issue, thanks a lot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

3 participants