-
Notifications
You must be signed in to change notification settings - Fork 903
fix some memleak #13306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix some memleak #13306
Conversation
avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function; mca_base_var_register will strdup, and not free old string; opal_patcher_base_framework not close; Signed-off-by: wjjahah <[email protected]>
opal_common_ucx open the opal_memory_base_framework but not close; Signed-off-by: wjjahah <[email protected]>
@@ -123,6 +125,7 @@ OPAL_DECLSPEC void opal_common_ucx_mca_var_register(const mca_base_component_t * | |||
"please set to '^posix,sysv,self,tcp,cma,knem,xpmem'.", | |||
MCA_BASE_VAR_TYPE_STRING, NULL, 0, MCA_BASE_VAR_FLAG_SETTABLE | MCA_BASE_VAR_FLAG_DWG, | |||
OPAL_INFO_LVL_3, MCA_BASE_VAR_SCOPE_LOCAL, opal_common_ucx.tls); | |||
free(old_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MCA is now correct as it uses a duplicate, but opal_common_ucx.tls
is not as it now points to a deallocated string. Why don't you just release these two strings in opal_common_ucx_mca_deregister
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MCA is now correct as it uses a duplicate, but
opal_common_ucx.tls
is not as it now points to a deallocated string. Why don't you just release these two strings inopal_common_ucx_mca_deregister
?
This function will enter multiple times, mca var list just record last duplicate value, and the last duplicate will free by var_destructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course this function will be entered multiple times, hopefully as many times as the UCX common infrastructure is used. This function counts the number of times the common part is used, and will release the internals only when all instances are dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When register mca variable ucx_tls and ucx_devices, just record newest duplicate, so memory allocate by last mca rigister(strdup) is leak.
Hi @bosilca , this PR is waiting for workflow approval. Could you please approve it when you have a moment? |
@bosilca Looks like you queued up the NVIDIA job last night, but it sat in queue for 10+ hours, so I tried to cancel it. But now it seems stuck. Can you check what's happening on the NVIDIA side? |
avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function;
mca_base_var_register will strdup, and not free old string;
opal_patcher_base_framework not close;