[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor) #773

john-forrest · 2024-08-06T09:01:09Z

Describe the bug
We are running AL2023 on several T3 ec2 instances - they have replaced some instances that ran centos 7 (now passed EOL) but we also have instances that run AL2 with similar items - we have never seen this issue before. In particular we have two instances that regularly appear to freeze up - their status is "running" but we cease to be able to connect to them over TCP and can only recover by rebooting. These instances run sonarqube and gitlab on docker containers - we don't get too much control over the internals, although have followed instructions for gitlab at least to run with less resources as much as possible, thinking this might help. Theory is that we are seeing the same issue as described in https://gist.github.com/raggi/1f8d0b9f45c5b62e7131b03e6e2ffe68 although that is ubuntu, it definitely sounds the same. We raised this AWS Support and were told the basis was that we'd used all our CPU Credits and thus the instance had been slowed down. From our viewpoint though the real issue is the way AL2023 is handling that.

I have hoped that, esp. since there is apparently a fix for this, AL2023 would itself be fixed, but no sign. We've added monitors that will reboot the instances automatically should they stop responding but it is not ideal and we have job failures etc for when the freeze happens.

To Reproduce
Steps to reproduce the behavior:

Run an app on the instance that has occasional peaks in CPU.
At some point the system will freeze.

Expected behavior
We should never hit this scenario

Screenshots

Additional context

We did the original analysis a few months ago but issue still occurring. We are keeping up to date on AL2023 releases.

ozbenh · 2024-08-09T03:33:28Z

I am not convinced by the explanation about using up CPU resources... it does look like DHCP is timing out which looks more like a networking problem to me... unless of course the CPU is slowed down so much that systemd fails to receive the DHCP responses but that sounds far fetched to me but the bug you linked does seem to indicate it as being a possibility...

We will attempt to get to the bottom of it

john-forrest · 2024-08-30T10:43:25Z

@ozbenh Any update on this?

ozbenh added the bug Something isn't working label Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor) #773

[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor) #773

john-forrest commented Aug 6, 2024

ozbenh commented Aug 9, 2024

john-forrest commented Aug 30, 2024

[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor) #773

[Bug] - Occasional freezing of AL2023 servers (dhcp timeout issue with slowed processor) #773

Comments

john-forrest commented Aug 6, 2024

ozbenh commented Aug 9, 2024

john-forrest commented Aug 30, 2024