Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures while assigning public IP via enableStaticNat #10313

Open
akrasnov-drv opened this issue Feb 2, 2025 · 13 comments
Open

Failures while assigning public IP via enableStaticNat #10313

akrasnov-drv opened this issue Feb 2, 2025 · 13 comments
Milestone

Comments

@akrasnov-drv
Copy link

problem

enableStaticNat starts failing just after several uses.
Actually I see that IP is assigned but the call still fails with timeout just after 3-4 uses
For my test I created a number of VMs, and then tried to assign public IP to all of them sequentially:

for i in `cloudstack listVirtualMachines | jq ".virtualmachine[].id"`
> do
> ip=`cloudstack associateIpAddress networkId=_my-network-id_ | jq ".ipaddress.id"`
> time cloudstack enableStaticNat ipaddressid=${ip} virtualmachineid=${i}
> done
{
  "success": true
}

real	0m9.272s
user	0m0.259s
sys	0m0.052s
{
  "success": true
}

real	0m10.097s
user	0m0.274s
sys	0m0.033s
{
  "success": true
}

real	0m10.301s
user	0m0.260s
sys	0m0.040s
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 446, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 441, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
    retries = retries.increment(
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 534, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 448, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/cloudstack", line 11, in <module>
    load_entry_point('cs==2.7.1', 'console_scripts', 'cs')()
  File "/usr/lib/python3/dist-packages/cs/__init__.py", line 104, in main
    response = getattr(cs, command)(fetch_result=fetch_result,
  File "/usr/lib/python3/dist-packages/cs/client.py", line 213, in handler
    return self._request(command, **kwargs)
  File "/usr/lib/python3/dist-packages/cs/client.py", line 273, in _request
    response = session.send(prepped,
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 657, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)

real	0m10.357s
user	0m0.281s
sys	0m0.068s
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 446, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 441, in _make_request
    httplib_response = conn.getresponse()
...

I tried different configurations of network and VR, and got it always failing, in the best case after 6-7 successful assignments.
large VR with 4 CPUS and several GB memory did not help neither.

time for all failing ones shows real 0m10.350s or slightly more.

Started from #10184

versions

CloudStack 4.20.0.0 with https://github.com/apache/cloudstack/pull/10254/files applied (PR does not help with this)
Ubuntu 22.04.5 LTS
libvirt 8.0.0-1ubuntu7.10
isolated network over VLAN
about 1000 public IPs in /20

The steps to reproduce the bug

  1. Create number of VMs (I used 100, but looks like just several should be enough
    Repeat 2-3 for VMs in 1. till it starts failing (just after about 3-4 cycles in my case)
  2. Use associateIpAddress to get public IP ID
  3. Use enableStaticNat with VM ID and IP ID
    Initially enableStaticNat takes 9 seconds then increases to 10, and then just starts failing with timeout

Here is a cycle doing the above

for i in `cloudstack listVirtualMachines | jq ".virtualmachine[].id"`; do
    ip=`cloudstack associateIpAddress networkId=${YOUR-NETWORK-ID-HERE} | jq ".ipaddress.id"`
    time cloudstack enableStaticNat ipaddressid=${ip} virtualmachineid=${i}
done

What to do about it?

Looks like the call is taking too much time to return. Should be optimized.

@akrasnov-drv akrasnov-drv changed the title Failures while assiging public IP via enableStaticNat Failures while assigning public IP via enableStaticNat Feb 2, 2025
@DaanHoogland DaanHoogland added this to the 4.20.1 milestone Feb 3, 2025
@akrasnov-drv
Copy link
Author

akrasnov-drv commented Feb 3, 2025

Just a small update.
After a day I tried to assign another public IP and got the same error. So the problem is not because of some temporary overload, it persists.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 446, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 441, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
    retries = retries.increment(
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 534, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 448, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/cloudstack", line 11, in <module>
    load_entry_point('cs==2.7.1', 'console_scripts', 'cs')()
  File "/usr/lib/python3/dist-packages/cs/__init__.py", line 104, in main
    response = getattr(cs, command)(fetch_result=fetch_result,
  File "/usr/lib/python3/dist-packages/cs/client.py", line 213, in handler
    return self._request(command, **kwargs)
  File "/usr/lib/python3/dist-packages/cs/client.py", line 273, in _request
    response = session.send(prepped,
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 657, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)

real	0m10.345s
user	0m0.308s
sys	0m0.028s

Another maybe interesting thing, when I tried to add public IP to VM that already had public IP, I immediately got corresponding error. 431 - is already associated with another public ip
I mean api response was very fast:

real	0m0.338s
user	0m0.248s
sys	0m0.057s

@akrasnov-drv
Copy link
Author

Release IP was also very slow, though I did it from UI, and it did not fail. From the allocated list IPs disappeared fast enough, but notifications about IP release started appearing in UI just after maybe some 40 seconds. Release of all 16 IPs I had took about 2 minutes.
What's interesting, after releasing all IPs I was able to assign IPs again, but even with delays between executions it started failing after 3 assigned IP, each time enableStaticNat took longer:

{
  "success": true
}

real	0m9.172s
user	0m0.276s
sys	0m0.036s

{
  "success": true
}

real	0m9.744s
user	0m0.298s
sys	0m0.020s

{
  "success": true
}

real	0m10.215s
user	0m0.275s
sys	0m0.032s

then failed with

real	0m10.347s
user	0m0.306s
sys	0m0.032s

@akrasnov-drv
Copy link
Author

Maybe anybody can suggest java parameters to improve this?

@akrasnov-drv
Copy link
Author

I found this warning in management log, seems it's relevant

2025-02-03 16:32:50,667 WARN  [o.a.c.m.w.WebhookServiceImpl] (qtp1438988851-11443:[ctx-5bff4776, ctx-109acf86, ctx-c08147b9]) (logid:d6760df6) Skipping delivering event [ID: null, description: {"details":"Ip Id: 117","event":"STATICNAT.ENABLE","status":"Completed"}] to any webhook as account ID is missing
2025-02-03 16:32:50,667 WARN  [o.a.c.f.e.EventDistributorImpl] (qtp1438988851-11443:[ctx-5bff4776, ctx-109acf86, ctx-c08147b9]) (logid:d6760df6) Failed to publish event [category: ActionEvent, type: STATICNAT.ENABLE] on bus webhookEventBus

The part of log corresponding to enableStaticNat (execution-failure) is attached

mgmt.log.gz

@DaanHoogland
Copy link
Contributor

@akrasnov-drv , do you have webhooks enabled? it seems like the issue is not in enableStaticNat itself.

@weizhouapache
Copy link
Member

weizhouapache commented Feb 4, 2025

What's interesting, after releasing all IPs I was able to assign IPs again, but even with delays between executions it started failing after 3 assigned IP, each time enableStaticNat took longer:

@akrasnov-drv
have you increased the setting workers and restart mgmt server ?

@shwstppr
Copy link
Contributor

shwstppr commented Feb 4, 2025

@DaanHoogland I don't think webhooks is the cause here. The logs shared are for the API call that finished in around 10s. I see multiple agent-server Command-Answer communication taking few seconds. Also, multiple errors like,

2025-02-03 16:32:46,464 DEBUG [c.c.a.t.Request] (AgentManager-Handler-58:[]) (logid:) Seq 1-1529253548469028850: Processing:  { Ans: , MgmtId: 108380656826451, via: 1, Ver: v1, Flags: 0, [{"com.cloud.agent.api.routing.GroupAnswer":{"results":["null - success: Creating file in VR, with ip: 169.254.100.174, file: staticnat_rules.json.ec8854ec-ec69-46e8-8192-60cbb8333222","null - success: Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.
Error: Nexthop has invalid gateway.

Maybe the API needs optimization or there is underlying network issue
@akrasnov-drv do you also have logs where API calls started failing with timeout?

@weizhouapache
Copy link
Member

Nexthop has invalid gateway.

it looks @akrasnov-drv uses redundant VRs, the error happens in the BACKUP VR which do not have default route as the public nic is DOWN.

mgmt.log:2025-02-03 16:32:46,485 DEBUG [c.c.a.t.Request] (qtp1438988851-11443:[ctx-5bff4776, ctx-109acf86, ctx-c08147b9]) (logid:d6760df6) Seq 9-1265230020314433522: Sending  { Cmd , MgmtId: 108380656826451, via: 9(kvm23), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.routing.VmDataCommand":{"vmIpAddress":"10.10.240.223","vmName":"tiny-test206","executeInSequence":"false","accessDetails":{"router.name":"r-1055-VM","router.guest.ip":"10.10.250.155","router.ip":"169.254.158.18","zone.network.type":"Advanced"},"wait":"0","bypassHostMaintenance":"false"}}] }
mgmt.log:2025-02-03 16:32:48,944 DEBUG [c.c.a.t.Request] (AgentManager-Handler-2:[]) (logid:) Seq 9-1265230020314433522: Processing:  { Ans: , MgmtId: 108380656826451, via: 9, Ver: v1, Flags: 10, [{"com.cloud.agent.api.routing.GroupAnswer":{"results":["null - success: Creating file in VR, with ip: 169.254.158.18, file: vm_metadata.json.aff5bcbb-6aa2-44a8-bd2d-8100c26a1095","null - success: Invalid unit name "[email protected],10.10.250.155" escaped as "[email protected]\x2c10.10.250.155" (maybe you should use systemd-escape?).
mgmt.log:2025-02-03 16:32:48,955 DEBUG [c.c.a.t.Request] (qtp1438988851-11443:[ctx-5bff4776, ctx-109acf86, ctx-c08147b9]) (logid:d6760df6) Seq 1-1529253548469028851: Sending  { Cmd , MgmtId: 108380656826451, via: 1(h239), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.routing.VmDataCommand":{"vmIpAddress":"10.10.240.223","vmName":"tiny-test206","executeInSequence":"false","accessDetails":{"router.name":"r-1056-VM","router.guest.ip":"10.10.248.182","router.ip":"169.254.100.174","zone.network.type":"Advanced"},"wait":"0","bypassHostMaintenance":"false"}}] }
mgmt.log:2025-02-03 16:32:50,667 DEBUG [c.c.a.t.Request] (AgentManager-Handler-4:[]) (logid:) Seq 1-1529253548469028851: Processing:  { Ans: , MgmtId: 108380656826451, via: 1, Ver: v1, Flags: 10, [{"com.cloud.agent.api.routing.GroupAnswer":{"results":["null - success: Creating file in VR, with ip: 169.254.100.174, file: vm_metadata.json.f026f90b-9aa8-4e05-ae39-21c2b3113891","null - success: Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.
mgmt.log:Error: Nexthop has invalid gateway.

@akrasnov-drv
Copy link
Author

First of all thanks for the attention and care.

@DaanHoogland I tried using webhooks in the past, but when started getting different issues I recreated the cluster without webhooks. Here is my agent config

#Storage
#Thu Jan 30 15:31:01 UTC 2025
cluster=1
pod=1
resource=com.cloud.hypervisor.kvm.resource.LibvirtComputingResource
private.network.device=private
icluster=1
domr.scripts.dir=scripts/network/domr/kvm
guest.cpu.mode=host-passthrough
guest.network.device=private
keystore.passphrase=XXXXXXXXXXXXX
hypervisor.type=kvm
port=8250
zone=1
public.network.device=public
local.storage.uuid=07db5346-58e2-4a8c-9b6f-08b3b57800a7
host=10.10.67.1@static
guid=1ce2f60a-5455-3f8a-846c-86b9413d2a76
LibvirtComputingResource.id=8
workers=5
iscsi.session.cleanup.enabled=false
vm.migrate.wait=3600

Nevertheless (I believe I reported it before) there is /etc/libvirt/hooks/qemu brought by cloudstack-agent package

md5sum /usr/share/cloudstack-agent/lib/libvirtqemuhook /etc/libvirt/hooks/qemu
5eb81c675b4caf8c5cb538e19673f29e  /usr/share/cloudstack-agent/lib/libvirtqemuhook
5eb81c675b4caf8c5cb538e19673f29e  /etc/libvirt/hooks/qemu

and I had some doubts about it. Though I do not see how it can be related to the current nat issue. I created 100 VMs via api without a problem, only this call fails with timeout.

@weizhouapache I have workers in global config set to 50, but as you see above, agent has it set to 5, and it's not something I set. I can increase that, no problem, but I really doubt number of workers should be relevant to a failure in single api call.
As to VR HA - yes, currently I have a pair, but I believe it failed also with a single one. I'll change setup to make sure and provide info on that.

@shwstppr I'll clean the env and start it again to provide a wider log, covering both successful executions and then failure.

In the meantime, has anybody tried my flow? Did it work (and then the failure is just in my env)?

@DaanHoogland
Copy link
Contributor

@akrasnov-drv enabling hundred(s) static nat adresses in not a usual case. But it should work. We'll have to investigate what might go wrong. Do you have a clean environment to experiment in?
If so, let's try to create it in a none redundant env to see how fast it occurs there.
Next step can be with redundant VR to see if this is the issue.

I doubt either is the culprit but we'll have to start simple.

@akrasnov-drv
Copy link
Author

I removed all VMs and network and recreated with standard isolated network offering with static nat with single VR.
Each call execution time was increasing with the last one almost at timeout:

Wed Feb  5 04:12:44 PM UTC 2025
{
  "success": true
}

real	0m3.929s
user	0m0.280s
sys	0m0.024s
Wed Feb  5 04:12:48 PM UTC 2025
Wed Feb  5 04:12:50 PM UTC 2025
{
  "success": true
}

real	0m3.842s
user	0m0.274s
sys	0m0.033s
Wed Feb  5 04:12:54 PM UTC 2025
Wed Feb  5 04:12:56 PM UTC 2025
{
  "success": true
}

real	0m3.973s
user	0m0.289s
sys	0m0.016s
Wed Feb  5 04:13:00 PM UTC 2025

...

Wed Feb  5 04:17:54 PM UTC 2025
{
  "success": true
}

real	0m9.809s
user	0m0.285s
sys	0m0.024s
Wed Feb  5 04:18:04 PM UTC 2025
Wed Feb  5 04:18:06 PM UTC 2025
{
  "success": true
}

real	0m10.102s
user	0m0.281s
sys	0m0.032s
Wed Feb  5 04:18:17 PM UTC 2025
Wed Feb  5 04:18:19 PM UTC 2025
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 446, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 441, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
    retries = retries.increment(
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 534, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 700, in urlopen
    httplib_response = self._make_request(
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 448, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 337, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/cloudstack", line 11, in <module>
    load_entry_point('cs==2.7.1', 'console_scripts', 'cs')()
  File "/usr/lib/python3/dist-packages/cs/__init__.py", line 104, in main
    response = getattr(cs, command)(fetch_result=fetch_result,
  File "/usr/lib/python3/dist-packages/cs/client.py", line 213, in handler
    return self._request(command, **kwargs)
  File "/usr/lib/python3/dist-packages/cs/client.py", line 273, in _request
    response = session.send(prepped,
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 657, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8080): Read timed out. (read timeout=10)

real	0m10.337s
user	0m0.293s
sys	0m0.036s
Wed Feb  5 04:18:29 PM UTC 2025

It managed to assign about 40 IPs on the way before failing.
So, there are 2 outcomes from my tests

  • redundant VR really doubles (even more) the time of the call
  • there is some problem that each next call takes more time (and it depends on number of calls/assigned IPs, not on some extra load, if I try the call after some significant time it still fails)

Attaching logs from management and from VR for the time of above.

cloud.log.gz

management-server.log.gz

@akrasnov-drv
Copy link
Author

@DaanHoogland it was not our intention. We are just trying to use CloudStack fleet in Jenkins. The only plugin supporting CloudStack is jcloud plugin, and it requires public IP and static nat.
I opened Jira ticket for this https://issues.jenkins.io/browse/JENKINS-74928 but have not got single reply for months.
So we are trying to use what we have. And in the end we'll need hundreds of nodes, maybe even about thousand.
Though, should not be a problem from iptables side...
Besides, I think to try changing jenkins plugin, but that's another story.

@DaanHoogland
Copy link
Contributor

Noted @akrasnov-drv , this is added to the backlog and has to be investigated. I am afraid I don't have a workaround off the top of my head.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants