Skip to content

Collect GPU index assignments from SLURM gres_detail and filter Job Analyzer GPU charts (#129)#129

Open
lushengt-meta wants to merge 1 commit intofacebookresearch:mainfrom
lushengt-meta:export-D99787988
Open

Collect GPU index assignments from SLURM gres_detail and filter Job Analyzer GPU charts (#129)#129
lushengt-meta wants to merge 1 commit intofacebookresearch:mainfrom
lushengt-meta:export-D99787988

Conversation

@lushengt-meta
Copy link
Copy Markdown

@lushengt-meta lushengt-meta commented Apr 9, 2026

Summary:

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

Background:

When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

Pipeline change (Python):

  • parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
  • squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
  • test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

Job Analyzer change (Hack):

  • FairJob.php: Added $gresGpuIndices property
  • FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
  • FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

Scope:

  • Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
  • Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
  • Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

Comment on lines +79 to +86
GRES_GPU_INDICES: str | None = field(
default=None,
metadata={
"parser": parse_gres_gpu_indices,
"field_name": "GRES_DETAIL",
"slurm_field": True,
},
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move to parsed_field here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsed_field doesn't support a default parameter — it delegates to field(metadata=...) without passing default. Since GRES_GPU_INDICES must default to None (it's only populated via the REST API path, not squeue CLI), I used field(default=None, metadata=...) directly. This preserves all the same metadata (parser, field_name, slurm_field) that parsed_field would set.

Comment thread gcm/monitoring/slurm/parsing.py Outdated
Comment on lines +322 to +323
if len(indices) >= 8 or not indices:
return None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we adapt this to work for nodes with > 8 GPUS?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the hardcoded len(indices) >= 8 threshold from the parser — it now always returns the parsed indices regardless of count. The filtering decision is moved to the PHP caller (FAIRJobAnalyzerPerfAnalyzerModule.php), which compares the number of GPU indices against total GPUs on the node. Added test cases for 10-GPU and 16-GPU nodes to verify.

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since squeue CLI has no GRES_DETAIL this field would always be null, we'd need to add support for Slurm REST (SlurmRestClient) in the collection pipeline

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Apr 10, 2026

@claude did we miss anything?

@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

Claude finished @luccabb's task in 1m 24s —— View job


Review: Did we miss anything?

  • Read changed files and gather context
  • Read CLAUDE.md for repo guidelines (none found)
  • Analyze for missed issues beyond existing review comments
  • Post comprehensive review

Beyond the items you already flagged, here are additional issues I found:


1. GRES_DETAIL is not a valid squeue -O format field — CLI collection will break

JOB_DATA_SLURM_FIELDS is used in client.py:179 to build the squeue --all -O format spec. Since GRES_DETAIL is not a recognized squeue output format field, this will cause squeue to error (or silently produce misaligned columns) for all CLI-based collection, not just this field.

This is the most critical issue. Since squeue CLI doesn't support GRES_DETAIL, adding it to JOB_DATA_SLURM_FIELDS (which happens automatically via the slurm_field: True metadata) will break the CLI path. The manual field() with "slurm_field": True on line 79-86 of squeue.py includes it in the CLI format spec.

Fix: Either set "slurm_field": False so it's excluded from JOB_DATA_SLURM_FIELDS, or (better) use parsed_field with slurm_field=False as luccabb suggested. The field would then only be populated via the REST API path in _map_job_fields. Fix this →

2. Hardcoded 8-GPU threshold doesn't generalize

As you noted in your review comment — parsing.py:322 hardcodes len(indices) >= 8 to detect full-node jobs. Nodes with 4 GPUs (e.g., A100x4) or >8 GPUs won't be handled correctly:

  • A 4-GPU full-node job would return "0,1,2,3" instead of None (unnecessary filtering)
  • A 10-GPU partial allocation (e.g., 10 of 16 GPUs) would return None (no filtering when it should)

The function has no way to know the node's total GPU count from gres_detail alone. Options:

  • Pass the node's GPU count as a parameter
  • Remove the threshold entirely and always return the parsed indices (let the caller decide)
  • Document this as a known limitation

3. _map_job_fields joins list values with commas — ambiguity with IDX commas

In rest_client.py:84, gres_detail (which is a JSON array) gets joined with ,:

if isinstance(value, list):
    value = ",".join(str(v) for v in value)

For a multi-node job, gres_detail might be ["gpu:ampere:3(IDX:0,3,5)", "gpu:ampere:3(IDX:1,4,7)"]. After joining: "gpu:ampere:3(IDX:0,3,5),gpu:ampere:3(IDX:1,4,7)".

The parse_gres_gpu_indices regex re.findall(r"IDX:([0-9,\-]+)", v) would match ["0,3,5", "1,4,7"] — correctly detecting multi-node (len > 1). However, if a single-node job has IDX values like IDX:0,3,5 and the comma-join places another entry adjacent without a clear delimiter, parsing could be ambiguous. Currently this works because the ) after IDX values separates them, but it's fragile. Consider using a different delimiter (e.g., |) for the list join, or better yet, handle the list in the parser directly.

4. Missing default=None in parsed_field approach

If you switch to parsed_field (per luccabb's comment), note that parsed_field doesn't currently support setting a default value — it delegates to field(metadata=...) without passing default. You'll need to either:

  • Add default support to parsed_field, or
  • Use Optional type annotation and ensure instantiate_dataclass handles missing keys by defaulting to None

Check how instantiate_dataclass handles missing keys for fields with slurm_field=False — if the key isn't in the row dict, it needs to gracefully default to None.

5. Minor: no @typechecked on the new test

The existing test functions in test_parsers.py use @typechecked decorator (e.g., test_parse_gres, test_parse_tres). The new test_parse_gres_gpu_indices at line 587 doesn't — minor inconsistency.


Summary

The most critical issue is #1slurm_field: True will inject GRES_DETAIL into the squeue CLI format spec, which will likely break CLI-based job collection. The other items are correctness/robustness improvements worth addressing.

@meta-codesync meta-codesync bot changed the title Collect GPU index assignments from SLURM gres_detail and filter Job Analyzer GPU charts Collect GPU index assignments from SLURM gres_detail and filter Job Analyzer GPU charts (#129) Apr 15, 2026
lushengt-meta added a commit to lushengt-meta/gcm that referenced this pull request Apr 15, 2026
…nalyzer GPU charts (facebookresearch#129)

Summary:

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

## Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

## Pipeline change (Python):
- parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
- squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
- test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

## Job Analyzer change (Hack):
- FairJob.php: Added $gresGpuIndices property
- FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
- FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

### Scope:
- Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
- Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
- Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988
@meta-codesync
Copy link
Copy Markdown

meta-codesync bot commented Apr 15, 2026

@lushengt-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99787988.

lushengt-meta added a commit to lushengt-meta/gcm that referenced this pull request Apr 15, 2026
…nalyzer GPU charts (facebookresearch#129)

Summary:

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

## Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

## Pipeline change (Python):
- parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
- squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
- test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

## Job Analyzer change (Hack):
- FairJob.php: Added $gresGpuIndices property
- FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
- FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

### Scope:
- Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
- Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
- Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988
@lushengt-meta
Copy link
Copy Markdown
Author

since squeue CLI has no GRES_DETAIL this field would always be null, we'd need to add support for Slurm REST (SlurmRestClient) in the collection pipeline

Correct — GRES_DETAIL is a REST API-only field and is already mapped via "gres_detail": "GRES_DETAIL" in REST_TO_SQUEUE_FIELD_MAP. The _map_job_fields method in SlurmRestClient populates it from the REST API response. For CLI-based collection, slurm_field is set to False so GRES_DETAIL is excluded from JOB_DATA_SLURM_FIELDS and won't be injected into the squeue format spec. The field defaults to None when not populated.

@lushengt-meta
Copy link
Copy Markdown
Author

@claude did we miss anything?

addressed issues that report by claude.

lushengt-meta added a commit to lushengt-meta/gcm that referenced this pull request Apr 15, 2026
…nalyzer GPU charts (facebookresearch#129)

Summary:

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

## Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

## Pipeline change (Python):
- parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
- squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
- test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

## Job Analyzer change (Hack):
- FairJob.php: Added $gresGpuIndices property
- FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
- FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

### Scope:
- Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
- Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
- Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988
@lushengt-meta lushengt-meta requested a review from luccabb April 15, 2026 01:48
lushengt-meta added a commit to lushengt-meta/gcm that referenced this pull request Apr 15, 2026
…nalyzer GPU charts (facebookresearch#129)

Summary:

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

## Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

## Pipeline change (Python):
- parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
- squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
- test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

## Job Analyzer change (Hack):
- FairJob.php: Added $gresGpuIndices property
- FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
- FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

### Scope:
- Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
- Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
- Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988
lushengt-meta added a commit to lushengt-meta/gcm that referenced this pull request Apr 15, 2026
…nalyzer GPU charts (facebookresearch#129)

Summary:
Pull Request resolved: facebookresearch#129

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

## Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

## Pipeline change (Python):
- parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
- squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
- test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

## Job Analyzer change (Hack):
- FairJob.php: Added $gresGpuIndices property
- FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
- FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

### Scope:
- Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
- Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
- Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988
…nalyzer GPU charts (facebookresearch#129)

Summary:

Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).

## Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.

The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7)

## Pipeline change (Python):
- parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data.
- squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries.
- test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A).

## Job Analyzer change (Hack):
- FairJob.php: Added $gresGpuIndices property
- FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table
- FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm.

### Scope:
- Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate)
- Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed)
- Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format)

Differential Revision: D99787988
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants