Skip to content

fix some memleak #13306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

fix some memleak #13306

wants to merge 2 commits into from

Conversation

wjjahah
Copy link

@wjjahah wjjahah commented Jun 13, 2025

avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function;
mca_base_var_register will strdup, and not free old string;
opal_patcher_base_framework not close;

avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function;
mca_base_var_register will strdup, and not free old string;
opal_patcher_base_framework not close;

Signed-off-by: wjjahah <[email protected]>
opal_common_ucx open the opal_memory_base_framework but not close;

Signed-off-by: wjjahah <[email protected]>
@@ -123,6 +125,7 @@ OPAL_DECLSPEC void opal_common_ucx_mca_var_register(const mca_base_component_t *
"please set to '^posix,sysv,self,tcp,cma,knem,xpmem'.",
MCA_BASE_VAR_TYPE_STRING, NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE | MCA_BASE_VAR_FLAG_DWG,
OPAL_INFO_LVL_3, MCA_BASE_VAR_SCOPE_LOCAL, opal_common_ucx.tls);
free(old_str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCA is now correct as it uses a duplicate, but opal_common_ucx.tls is not as it now points to a deallocated string. Why don't you just release these two strings in opal_common_ucx_mca_deregister?

Copy link
Author

@wjjahah wjjahah Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MCA is now correct as it uses a duplicate, but opal_common_ucx.tls is not as it now points to a deallocated string. Why don't you just release these two strings in opal_common_ucx_mca_deregister?

This function will enter multiple times, mca var list just record last duplicate value, and the last duplicate will free by var_destructor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course this function will be entered multiple times, hopefully as many times as the UCX common infrastructure is used. This function counts the number of times the common part is used, and will release the internals only when all instances are dropped.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When register mca variable ucx_tls and ucx_devices, just record newest duplicate, so memory allocate by last mca rigister(strdup) is leak.

@wjjahah
Copy link
Author

wjjahah commented Jun 17, 2025

Hi @bosilca , this PR is waiting for workflow approval. Could you please approve it when you have a moment?

@jsquyres
Copy link
Member

@bosilca Looks like you queued up the NVIDIA job last night, but it sat in queue for 10+ hours, so I tried to cancel it. But now it seems stuck. Can you check what's happening on the NVIDIA side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants