Feat/telemetry export clean by theap06 · Pull Request #89 · facebookresearch/gcm

theap06 · 2026-03-04T22:43:42Z

Closes #87

Summary

Adds a new telemetry exporter that periodically appends structured telemetry snapshots to a local file in JSON or CSV format for offline analysis.

What's New

New sink: --sink=telemetry
Output formats: JSON (NDJSON) or CSV
Options:
- file_path (required): Path to the output file
- format (optional): json (default) or csv

Usage

# NDJSON
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-a.ndjson --once

# CSV

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-b.csv --sink-opt format=csv --once
cat /tmp/telemetry-b.csv

CSV Append

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-b.csv --sink-opt format=csv --once
cat /tmp/telemetry-b.csv

##CSV Output

Output 1:
hostname,gpu_id,gpu_util,mem_used_percent
...,0,...,...
...,1,...,...

Output 2: 
hostname,gpu_id,gpu_util,mem_used_percent
...,0,...,...
...,1,...,...
...,0,...,...
...,1,...,...

Example Output (JSON)

JSON Command
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-a.ndjson --once
cat /tmp/telemetry-a.ndjson

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-c.ndjson --once
cat /tmp/telemetry-c.ndjson

Output 1:
{"hostname":"...","gpu_id":0,"gpu_util":...}
{"hostname":"...","gpu_id":1,"gpu_util":...}

Output 2: 
{"hostname":"...","gpu_id":0,"gpu_util":...}

Implementation

~60 lines of code in gcm/exporters/telemetry.py
Follows existing exporter conventions (@register, write(Log, SinkAdditionalParams))
Works with gcm nvml_monitor, gcm slurm_monitor, and health check commands
Auto-creates parent directories if needed

Testing

pytest gcm/tests/test_telemetry_exporter.py -v

============================= test session starts ==============================
collected 10 items
...
============================== 10 passed in <time> =============================


$ gcm nvml_monitor --sink=telemetry--sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ cat /tmp/telemetry.csv

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user


1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048
Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

github-actions · 2026-03-04T22:43:51Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

theap06 · 2026-03-08T01:01:28Z

@luccabb could I get some feedback?

luccabb

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

…research#87) Add a new 'telemetry' sink that periodically appends telemetry snapshots to a local file in JSON or CSV format for offline analysis. - JSON: NDJSON format (one object per line) - CSV: Header on first write, append rows - Options: file_path (required), format (json|csv, default json) - Works with nvml_monitor, slurm_monitor, and health checks Closes facebookresearch#87 Made-with: Cursor

Made-with: Cursor

theap06 · 2026-03-08T07:29:24Z

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

added support for csv

Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>

- Add format option (json|csv) to file exporter - CSV output uses flattened scuba message structure - First write adds header row, subsequent writes append - Update docs for both Monitoring and Health Checks Made-with: Cursor

…p06/gcm into feat/telemetry-export-clean

theap06 · 2026-03-10T23:13:36Z

@luccabb Could i get some feedback?

claude · 2026-03-12T01:08:43Z

Claude finished @luccabb's task in 47s —— View job

Re-review: Schema-Change Fix Not Implemented

@theap06 Unfortunately the critical issue has not been fixed. The code still writes a new header mid-file when the schema changes, producing an invalid CSV.

What the code still does (broken)

gcm/exporters/file.py lines 119–131:

if previous_fieldnames != fieldnames:
    header_buf = io.StringIO()
    header_writer = csv.DictWriter(header_buf, fieldnames=all_keys, ...)
    header_writer.writeheader()
    logger.info(header_buf.getvalue())   # ← writes new header to the SAME file
    self._csv_fieldnames[path] = fieldnames

When the schema changes, a new header row is written into the same file via the same logger. This produces the same invalid output as before:

job_id,state,user
1,RUNNING,alice
gpu_uuid,memory_used_mb    ← second header inserted mid-file
GPU-123,2048

Test still validates the broken behavior

gcm/tests/test_file_exporter.py lines 62–67:

lines = path.read_text().splitlines()
assert lines == [
    "job_id,state,user",
    "1,RUNNING,alice",
    "gpu_uuid,memory_used_mb",   # ← still asserts mixed-header output as "correct"
    "GPU-123,2048",
]

The test explicitly asserts this broken output as expected behavior — nothing has changed here.

What needs to happen

When a schema change is detected, a new file should be created instead of inserting a header into the existing file. For example:

First schema → telemetry.csv
Second schema → telemetry_1.csv
Third schema → telemetry_2.csv

The test should then verify that two separate, valid CSV files are created — each parseable independently.

Fix this →

theap06 · 2026-03-12T05:33:54Z

@claude made the changes. Can you review my pr?

luccabb · 2026-03-12T05:38:45Z

@claude

claude · 2026-03-12T05:41:31Z

Claude finished @luccabb's task in 54s —— View job

Re-review: Schema-Change Fix Verified ✓

Read gcm/exporters/file.py
Read gcm/tests/test_file_exporter.py
Verify the schema-change fix
Post review feedback

The critical issue has been fixed correctly this time. Here's what was implemented:

What Changed

New helper (file.py:29-33):

def _schema_versioned_path(path: str, schema_index: int) -> str:
    if schema_index == 0:
        return path
    stem, ext = os.path.splitext(path)
    return f"{stem}_{schema_index}{ext}"

New tracking dicts (file.py:64-67):

_data_identifier_to_base_path — remembers the original path for generating versioned names
_csv_schema_index — tracks the current rollover counter per DataIdentifier

Schema-change rollover logic (file.py:139-158): When the schema changes, a new versioned file is created (e.g., data.csv → data_1.csv), a new logger is initialized pointing to it, and the path/logger maps are updated. The subsequent if previous_fieldnames != fieldnames check then correctly writes the header to the new file.

Test updated (test_file_exporter_csv_rolls_over_on_schema_change, lines 45-72): Verifies two separate, valid CSV files are produced — data.csv with the first schema and data_1.csv with the second schema. Each file is independently parseable.

Remaining Minor Nit (non-blocking)

DictWriter created per-record inside the loop (file.py:173-182):

for record in records:
    row_buf = io.StringIO()
    row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
    row_writer.writerow(record)
    logger.info(row_buf.getvalue())

A new StringIO and DictWriter is allocated for every single record. The DictWriter could be created once outside the loop and only the StringIO reset per record. Minor performance issue, low priority.

Verdict

The critical CSV schema-change bug is fixed. The implementation is clean, follows existing logger/path patterns, and the test correctly validates the rollover behavior. This is ready to merge pending maintainer approval.

theap06 · 2026-03-12T05:56:06Z

@claude fixed the perf issues.

theap06 · 2026-03-12T06:18:31Z

@luccabb should be good to go now. maybe we can add a ci workflow like /rubric or something for quick feedback on prs. tool calls instead of always needing to call claude api.

luccabb

@theap06 keep the test plan updated as you make changes

theap06 · 2026-03-12T20:02:30Z

@luccabb done

luccabb

test plan is wrong. add the execution cmd (cat filename) and output from that cmd for the following lines.

you should likely be passing more than 1 filename, given that you'd want data to be written to different files. please test this thoroughly and ensure test plans are fully complete before sending back for review

theap06 · 2026-03-13T06:07:20Z

test plan is wrong. add the execution cmd (cat filename) and output from that cmd for the following lines.

you should likely be passing more than 1 filename, given that you'd want data to be written to different files. please test this thoroughly and ensure test plans are fully complete before sending back for review

Sorry about that! I just updated the the test plan

theap06 · 2026-03-14T02:55:49Z

@claude review my pr

luccabb · 2026-03-14T04:39:46Z

@claude

claude · 2026-03-14T04:39:59Z

Claude finished @luccabb's task in 1m 18s —— View job

PR Review: CSV Support for File Exporter

Read CLAUDE.md and repo context
Read gcm/exporters/file.py
Read gcm/tests/test_file_exporter.py
Read documentation changes
Analyze code quality, correctness, and completeness
Post review feedback

Summary

All previously identified issues have been addressed. The implementation is clean and correct.

What's Been Fixed

Schema-change rollover (file.py:139-158): When CSV schema changes, a new versioned file is created (data.csv → data_1.csv). Each file is a valid, independently parseable CSV. ✓
DictWriter reuse (file.py:173-184): DictWriter and StringIO are now created once outside the record loop; only seek(0) + truncate(0) are called per record. ✓
Tests (test_file_exporter.py:45-72): test_file_exporter_csv_rolls_over_on_schema_change correctly verifies two separate valid CSV files are produced. ✓
Documentation (GCM_Monitoring/exporters/file.md): The CSV section clearly explains the format option and schema-rollover behavior. ✓

Minor Nits (non-blocking)

1. Conditional initialization of _csv_fieldnames and _csv_schema_index (file.py:65-67):

if self.format == "csv":
    self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}
    self._csv_schema_index: Dict[DataIdentifier, int] = {}

These attributes only exist when format == "csv". Since the write() method gates all CSV access behind if self.format == "csv", this works correctly in practice — but it means mypy/type checkers may not see these as always-defined. Initializing them unconditionally as {} would be slightly cleaner, but it's not a bug.

2. GCM Health Checks doc not updated for CSV (website/docs/GCM_Health_Checks/exporters/file.md):

The health checks doc now has the format option in the table (line 17) but doesn't include the CSV section that was added to the monitoring doc. Minor inconsistency — the monitoring doc explains the rollover behavior but the health checks doc doesn't.

3. No test for multiple successive schema changes:

Only one rollover is tested (data.csv → data_1.csv). A test covering data_1.csv → data_2.csv would give higher confidence in the index-increment logic (file.py:144), but the implementation looks correct by inspection.

Verdict

The critical bug is fixed, performance nit is addressed, tests are correct, and documentation is updated. This is ready to merge. The nits above are minor and non-blocking.

theap06 · 2026-03-14T04:58:03Z

@luccabb should be ready to merge

luccabb

if I run this PR twice it breaks CSV structure, headers shouldn't be repeated in the middle of the file

$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ cat /tmp/telemetry.csv 
gpu_id,gpu_util,hostname,job_id,mem_used_percent,mem_util,power_draw,power_used_percent,retired_pages_count_double_bit,retired_pages_count_single_bit,temperature
0,0,host0101,-1,0,0,33111,14,0,0,38
1,0,host0101,-1,0,0,30435,12,0,0,33
gpu_id,gpu_util,hostname,job_id,mem_used_percent,mem_util,power_draw,power_used_percent,retired_pages_count_double_bit,retired_pages_count_single_bit,temperature
0,0,host0101,-1,0,0,33111,14,0,0,37
1,0,host0101,-1,0,0,30206,12,0,0,33

luccabb · 2026-03-17T19:04:55Z

@claude

claude · 2026-03-17T19:05:14Z

Claude finished @luccabb's task in 1m 32s —— View job

Re-review: New Bug — Header Repeated on Second Run

Bug Confirmed: Duplicate Header on Second Process Invocation

luccabb's diagnosis is exactly right. Here is the precise chain of events.

Root cause — `gcm/exporters/file.py`

__init__ (line 66):

self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}

_csv_fieldnames is always initialised as an empty dict on every new process start. It holds the only memory of "what fieldnames were already written to this file path."

write (lines 137 and 161–171):

previous_fieldnames = self._csv_fieldnames.get(path)   # line 137 — always None on 2nd run

...

if previous_fieldnames != fieldnames:                   # line 161 — None != anything → True
    header_writer.writeheader()                         # line 169 — header written again
    logger.info(header_buf.getvalue())                  # line 170
    self._csv_fieldnames[path] = fieldnames             # line 171

On the first run, previous_fieldnames starts as None, so the header is written and _csv_fieldnames[path] is set. On every subsequent write() call within the same process, previous_fieldnames == fieldnames and no header is re-emitted — correct.

On the second run (a fresh process invocation against the same file), _csv_fieldnames is empty again. self._csv_fieldnames.get(path) returns None, the condition None != fieldnames is True, and the header is written a second time mid-file.

What the tests miss

test_file_exporter_csv (line 25) and test_file_exporter_csv_rolls_over_on_schema_change (line 45) both create a fresh File object and a fresh file in tmp_path — they never simulate a second process opening an existing file. There is no test for the "append to existing file" scenario.

Fix needed

When __init__ registers a path and that path already exists on disk with at least one line, read the first line and use it to pre-populate _csv_fieldnames[path]. Something along the lines of:

# inside __init__, after each path is resolved and before init_logger
if self.format == "csv" and os.path.isfile(resolved_path):
    with open(resolved_path, newline="") as fh:
        reader = csv.reader(fh)
        try:
            existing_header = tuple(next(reader))
        except StopIteration:
            existing_header = None
    if existing_header:
        self._csv_fieldnames[resolved_path] = existing_header

This seeds the in-memory dict before the first write() call so that previous_fieldnames != fieldnames is False when the schema is unchanged, suppressing the duplicate header.

A matching test should create a File, write rows, then create a second File pointing to the same path and write more rows, asserting that csv.DictReader sees the header exactly once.

Summary

	Detail
File	`gcm/exporters/file.py`
Lines	`__init__` line 66 (dict init), `write` lines 137 and 161–171 (header guard)
Root cause	`_csv_fieldnames` is in-memory only; second invocation sees empty dict and re-emits header
Fix	Seed `_csv_fieldnames` from the first line of any pre-existing file during `__init__`
Missing test	No test covers "second process, same file" scenario

Fix this →

theap06 · 2026-03-18T06:21:53Z

@claude review my pr

theap06 · 2026-03-25T04:04:13Z

@luccabb could I get feedback on this pr

theap06 · 2026-04-01T06:41:18Z

@luccabb it should be good to merge

theap06 · 2026-04-05T09:06:51Z

@luccabb I'll close this pr for now/

theap06 requested review from calebho, giongto35, jj10306 and luccabb as code owners March 4, 2026 22:43

meta-cla bot added the cla signed label Mar 4, 2026

luccabb requested changes Mar 8, 2026

View reviewed changes

theap06 added 4 commits March 7, 2026 23:27

fix: correct expected UTC timestamp in telemetry exporter tests

d16392f

Made-with: Cursor

fix: satisfy nox format and typecheck

e3f37d8

changed to supporting csv

4f00cf7

theap06 force-pushed the feat/telemetry-export-clean branch from 94e1f2c to 4f00cf7 Compare March 8, 2026 07:28

theap06 requested a review from luccabb March 8, 2026 07:29

added the shutdown for the nox checking

3f82227

luccabb requested changes Mar 8, 2026

View reviewed changes

Comment thread gcm/exporters/telemetry.py Outdated

Comment thread gcm/exporters/file.py Outdated

Comment thread gcm/exporters/file.py

theap06 and others added 2 commits March 8, 2026 16:59

Update gcm/exporters/file.py

1b51ad7

Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>

fixed the telemetry file and added the final else

184a18e

theap06 requested a review from luccabb March 9, 2026 00:03

theap06 force-pushed the feat/telemetry-export-clean branch from 184a18e to 060fafe Compare March 9, 2026 09:03

theap06 requested a review from A-Kokolis as a code owner March 9, 2026 09:03

feat: add CSV format support to file exporter

961b5b7

- Add format option (json|csv) to file exporter - CSV output uses flattened scuba message structure - First write adds header row, subsequent writes append - Update docs for both Monitoring and Health Checks Made-with: Cursor

theap06 force-pushed the feat/telemetry-export-clean branch from 060fafe to 961b5b7 Compare March 9, 2026 09:43

Merge branch 'feat/telemetry-export-clean' of https://github.com/thea…

dc4c714

…p06/gcm into feat/telemetry-export-clean

luccabb requested changes Mar 10, 2026

View reviewed changes

Comment thread gcm/exporters/file.py

luccabb reviewed Mar 10, 2026

View reviewed changes

Comment thread gcm/exporters/file.py Outdated

fixed the logic in file.py

20ec9d8

theap06 requested a review from luccabb March 10, 2026 09:02

Roll CSV output to new file on schema change

a151464

Reuse CSV writer buffer per batch

a2dec03

luccabb reviewed Mar 12, 2026

View reviewed changes

luccabb requested changes Mar 12, 2026

View reviewed changes

theap06 requested a review from luccabb March 12, 2026 20:02

luccabb requested changes Mar 13, 2026

View reviewed changes

theap06 requested a review from luccabb March 13, 2026 06:07

luccabb requested changes Mar 17, 2026

View reviewed changes

Fix duplicate CSV header on process restart

e2351c7

theap06 requested a review from luccabb March 18, 2026 06:21

Merge branch 'main' into feat/telemetry-export-clean

634a3f6

theap06 closed this Apr 5, 2026

Conversation

theap06 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's New

Usage

Example Output (JSON)

Implementation

Testing

Uh oh!

github-actions bot commented Mar 4, 2026

CI Commands

Uh oh!

theap06 commented Mar 8, 2026

Uh oh!

luccabb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

theap06 commented Mar 10, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Schema-Change Fix Not Implemented

What the code still does (broken)

Test still validates the broken behavior

What needs to happen

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Schema-Change Fix Verified ✓

What Changed

Remaining Minor Nit (non-blocking)

Verdict

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 13, 2026

Uh oh!

theap06 commented Mar 14, 2026

Uh oh!

luccabb commented Mar 14, 2026

Uh oh!

claude bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: CSV Support for File Exporter

Summary

What's Been Fixed

Minor Nits (non-blocking)

Verdict

Uh oh!

theap06 commented Mar 14, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

luccabb commented Mar 17, 2026

Uh oh!

claude bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

theap06 commented Mar 4, 2026 •

edited

Loading

luccabb left a comment •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading

claude bot commented Mar 14, 2026 •

edited

Loading

claude bot commented Mar 17, 2026 •

edited

Loading

Root cause — `gcm/exporters/file.py`