Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not start a second GetNextMessageAsync thread in case one is already running #4354

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

echalone
Copy link
Contributor

@echalone echalone commented Jul 12, 2023

Prevent the agent from starting a second GetNextMessageAsync thread in case one is already running. This might happen for example if an automatic agent update was triggered due to a pipeline job coming in and the agent update then failing for some reason. In this case the agent would have started another GetNextMessageAsync thread in the background, which would wait for another message from the server, while trying to update the agent. But when the agent update fails this first thread will cycle back and also start another GetNextMessageAsync thread trying to wait for another message from the server.

However, since only one thread per agent session is "allowed" to wait for a response from the Azure DevOps REST API the second thread will immediately return instead of having to wait 50 seconds for a response as is usual and needed for the agent to work properly. This means that while there's one thread that will keep doing what it should do and check for new messages from the server in a 50 second polling cycle, the other thread will constantly call the same REST API uri and immediately return because the server doesn't let it wait for messages, then will again call the same REST API and again immediatly return, doing this in an endless loop and not just generating extreme network overhead (multiple calls per agent per second) but also fill up the IIS log of the Azure DevOps server so that any disc would be full in just a few days or even hours, thereby crashing the server (as happened to us).

This would happen if you have an agent with a higher version than needed by your server, and the server would try to downgrade the agent version with the first build or deployment job coming in, and then the update of the agent fails. It would not happen if the first job coming in is actually the agent update request itself, say because you clicked onto "Update all targets" on the server, it will also not happen if the agent update was successful, since the affected agent is shut down after the update and the new agent takes over with a new polling loop that is not affected.
Tested with Azure DevOps Server 2022 OnPremis and agent version 2.218.1 as well as with an agent compiled from the latest master branch.

Now, one can disable agent auto update for build agents, but there's actually no way yet to disable auto update for deployment agents in the GUI, which was what got us and resulted in a crashed server due to too little disk space (because of extreme large IIS logs). One needs to use the REST API to disable agent auto update for deployment agents, but that's probably another topic.

@echalone echalone requested review from a team as code owners July 12, 2023 14:00
@echalone
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 4354 in repo microsoft/azure-pipelines-agent

@merlynomsft
Copy link
Contributor

merlynomsft commented Jul 13, 2023

Hi @echalone , thank you for the analysis and detailed write up. For accepting this change, it would be helpful if you could create a knob that enables / disables the behavior. The new behavior should be off by default. That would help us with validating and testing before we make it the new default behavior. You'll be able to enable the behavior by setting the env variable defined by the knob..

Sorry that we don't have docs on this pattern currently:
You can follow the pattern of EnableVSPreReleaseVersions (excluding RuntimeKnobSource def as I think that will cause an exception in this code path)
https://github.com/microsoft/azure-pipelines-agent/blob/4978a1889290b22db564c65eb5a3b832264bcd17/src/Agent.Sdk/Knob/AgentKnobs.cs

@kirill-ivlev @tkasparek

@echalone
Copy link
Contributor Author

Hello @merlynomsft,

I've implemented the enabling of the bugfix via environment variable as you described and the default behavior is to not have the bugfix enabled.
The environment variable to enable this bugfix is now called PREVENT_PARALLEL_POLLING and has to be set to true for the fix to take effect. The default value for this setting is false and the agent will behave exactly as older versions do and as if the fix was not implemented.
When testing be aware the agents are not downgraded from major version 3 to major version 2. So if testing the latest master branch with a server that wants to downgrade to agent version 2 you need to temporarily set the major agent version to 2 as well for testing purposes.

In our case the hash check isn't successful when our Azure DevOps 2022 Server tries to downgrade to agent version from 2.218.1 to 2.210.1, maybe that helps to recreate a testing scenario.
I've tried adding a verbose tracing message to show when this second thread creation is skipped, but it didn't reliably log this message (although the behavior was correct every time) because the log file was probably locked by the other thread at that moment, so I didn't include this verbose logging message in the end... a logging message that isn't reliable logged might only confuse.

@merlynomsft merlynomsft added the misc Miscellaneous Changes label Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
misc Miscellaneous Changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants