Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when Pyroscope is enabled #28

Open
nwalters512 opened this issue Feb 21, 2023 · 6 comments
Open

Memory leak when Pyroscope is enabled #28

nwalters512 opened this issue Feb 21, 2023 · 6 comments

Comments

@nwalters512
Copy link
Contributor

When we enable Pyroscope on our application, we observe a steady increase in memory consumption. I've annotated a AWS CloudWatch graph of Node RSS memory:

Untitled drawing

Some commentary about what we saw:

  • After enabling Pyroscope, we saw memory usage increase pretty much linearly.
  • After a while, the EC2 instance as a whole became completely unresponsive. At that point I couldn't even SSH in to the instance to poke around, so I'm unable to even guess why that might be. Rebooting the entire VM "fixed" things (until they broke again).
    • You can identify the "unresponsive" periods as the ones that are missing data.
  • This was on our staging environment, so the app wasn't serving any real requests and was under basically zero load.
  • After disabling Pyroscope and making no other changes, we stopped leaking memory.
  • The graph lines that spike past 3.0GB/1.5GB are for an instance with a single copy of the app running on it. The other 4 lines are 4 copies of an app running on the same instance. I'm unsure why the two types of hosts behave differently. My best guess is that the OS starts doing a ton of swapping, or runs out of memory entirely (we're on an EC2 t3.medium instance that has 4GB of RAM, so 4 processes x 1GB each would definitely exhaust the available RAM).
  • Pyroscope itself doesn't reflect that the Node process is using anywhere close to 1GB of memory (per inuse_space, it maxed out at ~350MB of mem).

These lines of code are the only difference between "leaking memory" and "not leaking memory":

          const Pyroscope = require('@pyroscope/nodejs');
          Pyroscope.init({
            appName: 'prairielearn',
            // Assume `config` contains sensible values.
            serverAddress: config.pyroscopeServerAddress,
            authToken: config.pyroscopeAuthToken,
            tags: {
              instanceId: config.instanceId,
              ...(config.pyroscopeTags ?? {}),
            },
          });
          Pyroscope.start();

I recognize this isn't a ton of information to go off, so I'd be happy to provide anything else that might help get to the bottom of this. We'd love to use Pyroscope, but our experience so far is an obvious dealbreaker.

@Rperry2174
Copy link
Contributor

Thanks for reporting @nwalters512 and sorry for the inconvenience. We'll take a look and see if we have more questions / what it will take to fix this. cc @eh-am @petethepig

@nwalters512
Copy link
Contributor Author

Thanks @Rperry2174! In case it's useful, I've discovered that heap total/used memory remains constant even as the RSS memory grows seemingly without bounds. From some reading elsewhere (e.g. nodejs/help#1518), it seems as though that could indicate a leak in native code, perhaps in https://github.com/google/pprof-nodejs? Alternatively, it may be that there's not enough memory pressure for the system to be reclaiming this memory?

Here's the memory metrics for a single process on this host (I enabled Pyroscope only on that one process; the rest don't have Pyroscope enabled and don't see constant RSS growth):

Screenshot 2023-02-22 at 11 46 37

@nwalters512
Copy link
Contributor Author

I also managed to capture a heap snapshot when the RSS was at ~700MB. Unsurprisingly, the heap snapshot only shows ~110MB of memory allocations, which is consistent with the NodeMemoryHeapTotal and NodeMemoryHeapUsed metrics at the time.

@korniltsev
Copy link
Collaborator

@nwalters512 could you let me know your node version and also the arch and os/base docker image you're using? I'm curious if the issue is only happening on EC2 or if it's happening locally as well.

It would be really helpful if we could reproduce the issue with a docker container, but I'm not sure how difficult that would be to set up.

While I'm trying to reproduce the issue locally, maybe we could start by running CPU and memory profiling exclusively on your staging environment? Instead of using Pyroscope.start(), we could try using startCpuProfiling() and startHeapProfiling() to see if it's a CPU or memory issue or both. That might help us narrow down our focus for further investigation.

@nwalters512
Copy link
Contributor Author

@korniltsev this is Node 16.16.0, x86_64, Amazon Linux 2 running directly on the host (not inside Docker). Unfortunately I was unable to reproduce this locally, but it does happen very consistently across multiple EC2 hosts.

Good idea on trying to narrow this down to CPU vs. heap profiling! Let me give that a shot and report back with any findings.

@nwalters512
Copy link
Contributor Author

@korniltsev it does look like this is limited to CPU profiling.

Screenshot 2023-02-23 at 09 52 19

At around 17:13, I updated the code to only call startCpuProfiling() and restarted the process; you can see RSS starts growing immediately. At around 17:28, I changed it to only call startHeapProfiling(), and RSS has been stable since then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants