-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Socket handle leak on Linux VMs #684
Comments
If anyone has experience with this issue on Linux VMs, or if you have any insight as to possible causes, I would appreciate the feedback. I am able to do some limited validation of VMs on the the squeaksource.com server but I need to be very careful to avoid impacting users of that service, so suggestions or advice is welcome here. |
I have been building VMs from different points in the commit history, and testing them on squeaksource.com for the socket descriptor leak. I can now confirm that the problem is associated with (not necessarily caused by) the introduction of Linux EPOLL support in aio.c in October 2020: commit 171c235 VMs buit at this commit and later (merged at 5fea0e3), including current VMs, have the socket handle leak problem. VMs built from commits up through the immediately preceding commit (da7954d) do not have the socket leak. I was also able to build and test a current VM with the EPOLL logic disabled (#define HAVE_EPOLL 0, #define HAVE_EPOLL_PWAIT 0). This VM does not have the handle leak problem. |
It is quite clear that the socket handle link is associated with (not necessarily caused by) the use of Linux EPOLL in aio.c. On our squeaksource.com server, the issue can be reproduced within an hour of runtime, simply by running the Squeak on a VM with EPOLL in effect. However, if I run a copy of the same Squeak image on my local PC, connecting to the SqueakSource image from web browser on my local network, I am unable to reproduce the problem. A possible difference is that the production squeaksource.com server runs behind NGINX port forwarding, so there may be differences in the way the TCP sessions are handled (and closed) in that configuration. |
@dtlewis290 Maybe |
A colleague asked ChatGPT for some input. This maybe useful. “Yes, I'm familiar with how How
|
David, Can you somehow test with HTTP 1.0 and/or w/o keep-alive? We cloud change the nginx config, too but lets test first… |
Hmm I don't think that I know how to perform such a test, but also I am not really able to correlate the leaked socket handles to any specific client activity. The squeaksource.com image is serving requests from Squeak clients, web scraping robots, and me all at the same time. So if a TCP session issue leads to an unclosed handle in the VM, I do not really know how to figure out where it came from. All that I can say for sure is that after running for about an hour, there will be an accumulation of socket handles in the unix process for the VM. |
So where I'm coming from is:
you could use Here is what could lead to that:
I don't know which http server ( What I'm saying is: the layver above TCP could be a reason for lingereing sockets… |
Thanks for the tip on curl usage. To check my understanding, I should try running a squeaksource image locally on my PC with no nginx, and then make lots of connection requests with curl --http1.0 to mimic the kind of connections that would come from nginx. I should then look for an accumulation of socket handles for my VM process. Does that sound right? I have run some initial testing with the image serving on port 8888, and with connections being made with: $ watch -n 1 curl --http1.0 http://localhost:8888/OSProcess/ So far I see no socket handle leaks but I will give it some time and see if I can find anything. The image is using Kom from https://source.squeak.org/ss/KomHttpServer-cmm.10.mcz. |
@krono my local testing is inconclusive. I am unable to reproduce the handle leak on my laptop PC using either --http1.0 or --http1.1 so I am not able to say if this is a factor. The handle leak is very repeatable when running squeaksource.com on dan.box.squeak.org but I have not found a way to reproduce it on my local PC. |
@dtlewis290 Well, then the difference I see is that nginxi and Squeaksource communicate from within an |
As you may know, many network interfaces will kill idle connections after 5 mins. If you don’t see a leak on a self contained local machine but you do see it when going through ngnix, this could be related. Also if this is easily reproducible it should be easy enough to log socket activity to a log to determine which socket handles are not being closed. The down side is that the socket plugin isn’t a clean wrapper around the socket library so much of this if not all must be logged from the plugin and not from the image. |
Hi Eliot, hi jraiford1, hi everybody -- Please refrain from posting unverified ChatGPT answers. Every interested person may do ask ChatGPT personally for advice to keep pondering about the issue at hand. However, advertising potential hallucinations for everybody might even impede the overall progress on this matter. A disclaimer like "Of course take this with a grain of salt as it could be a complete hallucination :)" does not make this better. Overall, posting unverified ChatGPT answers is not worse or better than anybody who keeps guessing (instead of testing hypotheses) or derailing a discussion through personal opinions. Instead, do some research, write some code, share tested solutions -- put some effort into it. If ChatGPT helps you remember a fact you already know, that's okay. Feel free to use such tools when formulating your answers here. 👍 Please be careful. Thank you. |
Open socket handles accumulate in /proc//fd for an image running an active SqueakSource server. Open handles accumulate gradually, eventually leading to image lockup when the Linux per-process 1024 handle limit is reached. /usr/bin/ss shows an accumulation of sockets in CLOSE_WAIT status, fewer than the handles in /proc//fd list but presumably associated with TCP sessions for sockets not properly closed from the VM.
Issue observed in a 5.0-202312181441 VM, and is not present in a 5.0-202004301740 VM. Other Linux VMs later than 5.0-202312181441 are likely affected, although this has not been confirmed. See also discussions on the box-admins Slack channel.
The text was updated successfully, but these errors were encountered: