Collect GPU index assignments from SLURM gres_detail and filter Job Analyzer GPU charts (#129)#129
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
| GRES_GPU_INDICES: str | None = field( | ||
| default=None, | ||
| metadata={ | ||
| "parser": parse_gres_gpu_indices, | ||
| "field_name": "GRES_DETAIL", | ||
| "slurm_field": True, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
can we move to parsed_field here?
There was a problem hiding this comment.
parsed_field doesn't support a default parameter — it delegates to field(metadata=...) without passing default. Since GRES_GPU_INDICES must default to None (it's only populated via the REST API path, not squeue CLI), I used field(default=None, metadata=...) directly. This preserves all the same metadata (parser, field_name, slurm_field) that parsed_field would set.
| if len(indices) >= 8 or not indices: | ||
| return None |
There was a problem hiding this comment.
can we adapt this to work for nodes with > 8 GPUS?
There was a problem hiding this comment.
Done. Removed the hardcoded len(indices) >= 8 threshold from the parser — it now always returns the parsed indices regardless of count. The filtering decision is moved to the PHP caller (FAIRJobAnalyzerPerfAnalyzerModule.php), which compares the number of GPU indices against total GPUs on the node. Added test cases for 10-GPU and 16-GPU nodes to verify.
luccabb
left a comment
There was a problem hiding this comment.
since squeue CLI has no GRES_DETAIL this field would always be null, we'd need to add support for Slurm REST (SlurmRestClient) in the collection pipeline
|
@claude did we miss anything? |
|
Claude finished @luccabb's task in 1m 24s —— View job Review: Did we miss anything?
Beyond the items you already flagged, here are additional issues I found: 1.
|
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
a744964 to
828670e
Compare
|
@lushengt-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99787988. |
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
828670e to
7eafc8d
Compare
Correct — GRES_DETAIL is a REST API-only field and is already mapped via "gres_detail": "GRES_DETAIL" in REST_TO_SQUEUE_FIELD_MAP. The _map_job_fields method in SlurmRestClient populates it from the REST API response. For CLI-based collection, slurm_field is set to False so GRES_DETAIL is excluded from JOB_DATA_SLURM_FIELDS and won't be injected into the squeue format spec. The field defaults to None when not populated. |
addressed issues that report by claude. |
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
7eafc8d to
6b7e07f
Compare
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
6b7e07f to
f2bafdd
Compare
…nalyzer GPU charts (facebookresearch#129) Summary: Pull Request resolved: facebookresearch#129 Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
f2bafdd to
da069ae
Compare
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
da069ae to
82c9f41
Compare
Summary:
Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).
Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.
The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job -d | grep GRES → GRES=gpu:ampere:1(IDX:7)
Pipeline change (Python):
Job Analyzer change (Hack):
Scope:
Differential Revision: D99787988