Memory leak from new rename feature #95

jeblair · 2025-03-02T17:06:55Z

The new rename feature in #89 appears to cause a memory leak.

It looks like every sample we receive is appended to a list here:

Line 154 in a15e01e

metric_refs[original_topic].append((prom_metric_id, labels))

And also again a few lines below.

One solution to this would be to change the metric_refs data structure to avoid duplication.

But would it be easier to use .clear() to clear all the labelsets from the metric instead of our needing to remember them all?

https://github.com/prometheus/client_python/blob/92b23970f032cbc990aa0e501708c425708e51ea/prometheus_client/metrics.py#L206

Then we can just rely on the prometheus client knowing what needs to be removed instead of keeping track in the exporter.

kpetremann · 2025-03-02T18:29:42Z

Hi @jeblair,

Thanks for reporting this.

However, I may be wrong, but I am not sure it can be qualified as a memory leak.
The exporter keep metrics in memory on purpose as the goal is to expose the last known value for sensors or any other IOT device.

Also, unless I am wrong, the issue you are reporting was already existing before this PR as it keeps metrics in Prometheus registry indefinitely. This PR just make it worse, am I right?

May I ask:

what is your usage of this exporter?
how big is the memory usage on your side? on my side it is about 2MB and does not evolve (unless I am adding a new IOT sensor in my network).
Do you confirm you had no issue in the past? or are you a new user of this exporter?

But would it be easier to use .clear() to clear all the labelsets from the metric instead of our needing to remember them all?

The metrics are never removed, unless there is a device renaming event in zigbee2mqtt (example here: https://github.com/kpetremann/mqtt-exporter/pull/89/files#diff-6457339f71452db172c35bd245b87a4593d9a6a59cad2f59f4d4131a3c470f05R431-R435)

One way to "fix" this would be to clear all metrics upon collection, but it could break many usage, like sensors with low rate of changes/reporting. Graphs would become "dots" instead of lines. Same for alerts, it would break.

Thanks!

jeblair · 2025-03-02T19:21:46Z

Sorry if I wasn't clear. I do understand about the normal memory usage of both the prometheus client library and the mqtt exporter in terms of what they need to keep in memory in order to serve the scrape endpoint.

what is your usage of this exporter?

Home monitoring (zigbee2mqtt and esphome sensors).

how big is the memory usage on your side? on my side it is about 2MB and does not evolve (unless I am adding a new IOT sensor in my network).

It grew to 500MB before the oomkiller got it. I'm not sure what the previous steady-state size was.

Do you confirm you had no issue in the past? or are you a new user of this exporter?

Yes, I've been using older versions without a problem for about a year (thanks for the program, btw!). I only noticed swapping and the oomkiller on a recent upgrade to 1.6.1. I just reverted to 1.5.0; not enough time has passed to confirm the problem isn't present (it takes about 24 hours to become a problem).

The specific memory leak I'm referring to is that after #89, the exporter appends a tuple of (PromMetricId, labels) for every sample that it receives. To be clear, the issue is not that we're storing information for every metric, it's that we're appending this tuple to a list for every value.

In order to see the leak in action, you may need a fairly busy message bus. I have sensors that emit several values every second, which means the append in line 154 runs every second for all of those values.

I think in #89 the intent was to store all of the labels for all of the metrics, so they could be removed on rename. I think the bug is that we're storing a copy of that every time we receive a value, instead of just storing unique metric+label pairs.

kpetremann · 2025-03-02T19:59:46Z

many thanks for all the info.

thanks for the program, btw!

You are welcome :)

the exporter appends a tuple of (PromMetricId, labels) for every sample that it receives

indeed, it is a cause good candidate.

It grew to 500MB before the oomkiller got it

oh my, ok this is definitely a memory leak issue :/ sorry about that!
Would it be possible for you to test with this branch: https://github.com/kpetremann/mqtt-exporter/tree/fix_ref_mem_leak

I'll try to free more time, if it does not fix the issue.

jeblair · 2025-03-02T20:19:57Z

oh my, ok this is definitely a memory leak issue :/ sorry about that! Would it be possible for you to test with this branch: https://github.com/kpetremann/mqtt-exporter/tree/fix_ref_mem_leak

I'll try to free more time, if it does not fix the issue.

Sure thing, running it now, thanks. I'll report back tomorrow.

ps output at start:

   PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
888275  1.7  2.3  28084 22692 ?        Ssl  20:16   0:01 python /opt/mqtt-exporter/exporter.py

jeblair · 2025-03-04T23:47:44Z

Sorry for the late reply, but 2 days later, it's looking good:

   PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
888275  0.9  1.6  30328 16136 ?        Ssl  Mar02  29:06 python /opt/mqtt-exporter/exporter.py

kpetremann · 2025-03-05T22:26:09Z

this very good news :)
I'll merge the code to the master then.

jeblair added the bug Something isn't working label Mar 2, 2025

jeblair assigned kpetremann Mar 2, 2025

kpetremann mentioned this issue Mar 5, 2025

fix: metric references memory leak #97

Merged

kpetremann closed this as completed in #97 Mar 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak from new rename feature #95

Memory leak from new rename feature #95

jeblair commented Mar 2, 2025

kpetremann commented Mar 2, 2025 •

edited

Loading

jeblair commented Mar 2, 2025

kpetremann commented Mar 2, 2025 •

edited

Loading

jeblair commented Mar 2, 2025

jeblair commented Mar 4, 2025

kpetremann commented Mar 5, 2025

Memory leak from new rename feature #95

Memory leak from new rename feature #95

Comments

jeblair commented Mar 2, 2025

kpetremann commented Mar 2, 2025 • edited Loading

jeblair commented Mar 2, 2025

kpetremann commented Mar 2, 2025 • edited Loading

jeblair commented Mar 2, 2025

jeblair commented Mar 4, 2025

kpetremann commented Mar 5, 2025

kpetremann commented Mar 2, 2025 •

edited

Loading

kpetremann commented Mar 2, 2025 •

edited

Loading