Skip to content

Conversation

@abouteiller
Copy link
Contributor

…sue #138

@abouteiller abouteiller requested a review from a team as a code owner March 12, 2025 13:41
@abouteiller abouteiller self-assigned this Mar 12, 2025
@abouteiller abouteiller added the bug Something isn't working label Mar 12, 2025
@abouteiller abouteiller linked an issue Mar 12, 2025 that may be closed by this pull request
@devreal
Copy link
Contributor

devreal commented Mar 12, 2025

Is there a way to handle this more gracefully in PaRSEC than a Segfault?

@abouteiller
Copy link
Contributor Author

abouteiller commented Mar 12, 2025

Is there a way to handle this more gracefully in PaRSEC than a Segfault?

Yes, the resultant symptom is that we have data.dst_type==NULL and data.dst_count==1 which is impossible, we could at the minimum assert.

Copy link
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How has this ever worked without a proper datatype and arena ? The puzzling part if that no GEMM nor POTRF has a default tile type.

@abouteiller
Copy link
Contributor Author

in PO and GEMM_SUMMA/GEMM we use the adt_all_loc to create and populate the arenas in a wrapped collection. We did not addapt GEMM_NN_GPU to use that same feature, so it must still rely on setting the arena_datatypes[DEFAULT_IDX]. We never select GEMM_NN_GPU except in memory constrained scenarios, which were broken due to separate problems that were masking this issue by crashing before we hit it (see ICLDisco/parsec#733).

Also has been reported by J John #67 but we didn't follow on it because we would not see it ourselves.

@bosilca
Copy link
Contributor

bosilca commented Mar 12, 2025

The arenas in the data collection are not linked to the arenas in the taskpool, and that datatype is only used for send, once we have a valid data copy. When we receive a data and we create the local datacopy that is not attached to any data collection, we extract the arena and the datatype from the taskpool, and right now these are NULL. So how are we receiving anything ?

@abouteiller
Copy link
Contributor Author

(from slack, for posterity) dplasmajdf_lapack_dtt.h:61, we don’t hit the arena_datatypes in the taskpool in this pathway. This interacts with how we set the type_remote in the JDF (ADT_DC), we read the types from a hashtable then.

@abouteiller
Copy link
Contributor Author

We discussed on slack that it would be good to unify the ADTT_READ-hashtable and the arena_datatypes array mechanics, but I don't have time ATM for this so later.

@abouteiller abouteiller force-pushed the bugfix/gemm_gpu_missing_dtype branch 3 times, most recently from e522b6a to 74ab1a6 Compare March 13, 2025 19:46
@abouteiller abouteiller force-pushed the bugfix/gemm_gpu_missing_dtype branch from 74ab1a6 to 272114e Compare March 13, 2025 19:46
@abouteiller abouteiller requested a review from bosilca March 14, 2025 17:45
@abouteiller abouteiller merged commit 2956a05 into ICLDisco:master Mar 27, 2025
9 checks passed
@abouteiller abouteiller deleted the bugfix/gemm_gpu_missing_dtype branch March 27, 2025 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GEMM NN GPU fails even with the under-transfer fix

4 participants