You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
We are running AL2023 on several T3 ec2 instances - they have replaced some instances that ran centos 7 (now passed EOL) but we also have instances that run AL2 with similar items - we have never seen this issue before. In particular we have two instances that regularly appear to freeze up - their status is "running" but we cease to be able to connect to them over TCP and can only recover by rebooting. These instances run sonarqube and gitlab on docker containers - we don't get too much control over the internals, although have followed instructions for gitlab at least to run with less resources as much as possible, thinking this might help. Theory is that we are seeing the same issue as described in https://gist.github.com/raggi/1f8d0b9f45c5b62e7131b03e6e2ffe68 although that is ubuntu, it definitely sounds the same. We raised this AWS Support and were told the basis was that we'd used all our CPU Credits and thus the instance had been slowed down. From our viewpoint though the real issue is the way AL2023 is handling that.
I have hoped that, esp. since there is apparently a fix for this, AL2023 would itself be fixed, but no sign. We've added monitors that will reboot the instances automatically should they stop responding but it is not ideal and we have job failures etc for when the freeze happens.
To Reproduce
Steps to reproduce the behavior:
Run an app on the instance that has occasional peaks in CPU.
At some point the system will freeze.
Expected behavior
We should never hit this scenario
Screenshots
Additional context
We did the original analysis a few months ago but issue still occurring. We are keeping up to date on AL2023 releases.
The text was updated successfully, but these errors were encountered:
I am not convinced by the explanation about using up CPU resources... it does look like DHCP is timing out which looks more like a networking problem to me... unless of course the CPU is slowed down so much that systemd fails to receive the DHCP responses but that sounds far fetched to me but the bug you linked does seem to indicate it as being a possibility...
Describe the bug
We are running AL2023 on several T3 ec2 instances - they have replaced some instances that ran centos 7 (now passed EOL) but we also have instances that run AL2 with similar items - we have never seen this issue before. In particular we have two instances that regularly appear to freeze up - their status is "running" but we cease to be able to connect to them over TCP and can only recover by rebooting. These instances run sonarqube and gitlab on docker containers - we don't get too much control over the internals, although have followed instructions for gitlab at least to run with less resources as much as possible, thinking this might help. Theory is that we are seeing the same issue as described in https://gist.github.com/raggi/1f8d0b9f45c5b62e7131b03e6e2ffe68 although that is ubuntu, it definitely sounds the same. We raised this AWS Support and were told the basis was that we'd used all our CPU Credits and thus the instance had been slowed down. From our viewpoint though the real issue is the way AL2023 is handling that.
I have hoped that, esp. since there is apparently a fix for this, AL2023 would itself be fixed, but no sign. We've added monitors that will reboot the instances automatically should they stop responding but it is not ideal and we have job failures etc for when the freeze happens.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
We should never hit this scenario
Screenshots
Additional context
We did the original analysis a few months ago but issue still occurring. We are keeping up to date on AL2023 releases.
The text was updated successfully, but these errors were encountered: