Skip to content

Conversation

@devreal
Copy link
Contributor

@devreal devreal commented Feb 24, 2025

We must not call transfer_ownership with a WRITE flow under transfer. We cannot distinguish between stage-in and eviction so we just bail out and come back later.

The assert that otherwise triggers is here: https://github.com/ICLDisco/parsec/blob/master/parsec/data.c#L420

We must not call transfer_ownership with a WRITE flow under transfer.
We cannot distinguish between stage-in and eviction so we just bail
out and come back later.

Signed-off-by: Joseph Schuchart <[email protected]>
@devreal devreal requested a review from a team as a code owner February 24, 2025 16:59
@devreal
Copy link
Contributor Author

devreal commented Feb 24, 2025

Alternative: reset the transfer flag and set the host version to 0. The epilogue of the w2r task will ignore the host copy if it detects that the device version has changed (https://github.com/ICLDisco/parsec/blob/master/parsec/mca/device/transfer_gpu.c#L319)

@abouteiller abouteiller added the bug Something isn't working label Mar 7, 2025
@abouteiller abouteiller added this to the v4.1 milestone Mar 7, 2025
@abouteiller abouteiller self-requested a review March 7, 2025 17:46
@abouteiller abouteiller linked an issue Mar 7, 2025 that may be closed by this pull request
5 tasks
@abouteiller
Copy link
Contributor

abouteiller commented Mar 7, 2025

Interestingly that fixes the assert for POTRF with the lowmem parameters (ctest), but not for GEMM lowmem_mpi. Everything else works, so overall it is improving the current state, but I had expected and hoped it would fix both.

@abouteiller

This comment was marked as resolved.

@abouteiller

This comment was marked as resolved.

@abouteiller

This comment was marked as outdated.

Copy link
Contributor

@abouteiller abouteiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall improvement over master, case that is still deadlocking #741 also deadlock in master

Pass all other ctests and large-size PO/GEMM

@bosilca bosilca merged commit d9b1263 into ICLDisco:master Mar 13, 2025
12 checks passed
@abouteiller abouteiller mentioned this pull request Mar 14, 2025
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check dplasma ctest pass

3 participants