do not start a second GetNextMessageAsync thread in case one is already running #4354

echalone · 2023-07-12T14:00:38Z

Prevent the agent from starting a second GetNextMessageAsync thread in case one is already running. This might happen for example if an automatic agent update was triggered due to a pipeline job coming in and the agent update then failing for some reason. In this case the agent would have started another GetNextMessageAsync thread in the background, which would wait for another message from the server, while trying to update the agent. But when the agent update fails this first thread will cycle back and also start another GetNextMessageAsync thread trying to wait for another message from the server.

However, since only one thread per agent session is "allowed" to wait for a response from the Azure DevOps REST API the second thread will immediately return instead of having to wait 50 seconds for a response as is usual and needed for the agent to work properly. This means that while there's one thread that will keep doing what it should do and check for new messages from the server in a 50 second polling cycle, the other thread will constantly call the same REST API uri and immediately return because the server doesn't let it wait for messages, then will again call the same REST API and again immediatly return, doing this in an endless loop and not just generating extreme network overhead (multiple calls per agent per second) but also fill up the IIS log of the Azure DevOps server so that any disc would be full in just a few days or even hours, thereby crashing the server (as happened to us).

This would happen if you have an agent with a higher version than needed by your server, and the server would try to downgrade the agent version with the first build or deployment job coming in, and then the update of the agent fails. It would not happen if the first job coming in is actually the agent update request itself, say because you clicked onto "Update all targets" on the server, it will also not happen if the agent update was successful, since the affected agent is shut down after the update and the new agent takes over with a new polling loop that is not affected.
Tested with Azure DevOps Server 2022 OnPremis and agent version 2.218.1 as well as with an agent compiled from the latest master branch.

Now, one can disable agent auto update for build agents, but there's actually no way yet to disable auto update for deployment agents in the GUI, which was what got us and resulted in a crashed server due to too little disk space (because of extreme large IIS logs). One needs to use the REST API to disable agent auto update for deployment agents, but that's probably another topic.

…dy running

echalone · 2023-07-12T14:03:04Z

/azp run

azure-pipelines · 2023-07-12T14:03:12Z

Commenter does not have sufficient privileges for PR 4354 in repo microsoft/azure-pipelines-agent

merlynomsft · 2023-07-13T14:58:54Z

Hi @echalone , thank you for the analysis and detailed write up. For accepting this change, it would be helpful if you could create a knob that enables / disables the behavior. The new behavior should be off by default. That would help us with validating and testing before we make it the new default behavior. You'll be able to enable the behavior by setting the env variable defined by the knob..

Sorry that we don't have docs on this pattern currently:
You can follow the pattern of EnableVSPreReleaseVersions (excluding RuntimeKnobSource def as I think that will cause an exception in this code path)
https://github.com/microsoft/azure-pipelines-agent/blob/4978a1889290b22db564c65eb5a3b832264bcd17/src/Agent.Sdk/Knob/AgentKnobs.cs

@kirill-ivlev @tkasparek

…vironment knob

echalone · 2023-07-14T12:06:05Z

Hello @merlynomsft,

I've implemented the enabling of the bugfix via environment variable as you described and the default behavior is to not have the bugfix enabled.
The environment variable to enable this bugfix is now called PREVENT_PARALLEL_POLLING and has to be set to true for the fix to take effect. The default value for this setting is false and the agent will behave exactly as older versions do and as if the fix was not implemented.
When testing be aware the agents are not downgraded from major version 3 to major version 2. So if testing the latest master branch with a server that wants to downgrade to agent version 2 you need to temporarily set the major agent version to 2 as well for testing purposes.

In our case the hash check isn't successful when our Azure DevOps 2022 Server tries to downgrade to agent version from 2.218.1 to 2.210.1, maybe that helps to recreate a testing scenario.
I've tried adding a verbose tracing message to show when this second thread creation is skipped, but it didn't reliably log this message (although the behavior was correct every time) because the log file was probably locked by the other thread at that moment, so I didn't include this verbose logging message in the end... a logging message that isn't reliable logged might only confuse.

src/Agent.Listener/Agent.cs

do not start a second GetNextMessageAsync thread in case one is alrea…

5c8a240

…dy running

echalone requested review from a team as code owners July 12, 2023 14:00

Merge branch 'master' into users/echalone/AutoUpdatePollingCycleFix

9ba13aa

Added possibility to enable message queue polling cycle bugfix via en…

27055a8

…vironment knob

echalone added 2 commits July 21, 2023 11:43

Merge branch 'master' into users/echalone/AutoUpdatePollingCycleFix

bd0cda8

Merge branch 'master' into users/echalone/AutoUpdatePollingCycleFix

e23f3f0

merlynomsft added the misc Miscellaneous Changes label Oct 18, 2023

merlynomsft reviewed Oct 18, 2023

View reviewed changes

src/Agent.Listener/Agent.cs Show resolved Hide resolved

Merge branch 'master' into users/echalone/AutoUpdatePollingCycleFix

c3e24f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

do not start a second GetNextMessageAsync thread in case one is already running #4354

do not start a second GetNextMessageAsync thread in case one is already running #4354

echalone commented Jul 12, 2023 •

edited

Loading

echalone commented Jul 12, 2023

azure-pipelines bot commented Jul 12, 2023

merlynomsft commented Jul 13, 2023 •

edited

Loading

echalone commented Jul 14, 2023

do not start a second GetNextMessageAsync thread in case one is already running #4354

Are you sure you want to change the base?

do not start a second GetNextMessageAsync thread in case one is already running #4354

Conversation

echalone commented Jul 12, 2023 • edited Loading

echalone commented Jul 12, 2023

azure-pipelines bot commented Jul 12, 2023

merlynomsft commented Jul 13, 2023 • edited Loading

echalone commented Jul 14, 2023

echalone commented Jul 12, 2023 •

edited

Loading

merlynomsft commented Jul 13, 2023 •

edited

Loading