Skip to content

Feat/telemetry export clean#89

Closed
theap06 wants to merge 18 commits intofacebookresearch:mainfrom
theap06:feat/telemetry-export-clean
Closed

Feat/telemetry export clean#89
theap06 wants to merge 18 commits intofacebookresearch:mainfrom
theap06:feat/telemetry-export-clean

Conversation

@theap06
Copy link
Copy Markdown
Contributor

@theap06 theap06 commented Mar 4, 2026

Closes #87

Summary

Adds a new telemetry exporter that periodically appends structured telemetry snapshots to a local file in JSON or CSV format for offline analysis.

What's New

  • New sink: --sink=telemetry
  • Output formats: JSON (NDJSON) or CSV
  • Options:
    • file_path (required): Path to the output file
    • format (optional): json (default) or csv

Usage

# NDJSON
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-a.ndjson --once

# CSV

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-b.csv --sink-opt format=csv --once
cat /tmp/telemetry-b.csv

CSV Append

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-b.csv --sink-opt format=csv --once
cat /tmp/telemetry-b.csv

##CSV Output

Output 1:
hostname,gpu_id,gpu_util,mem_used_percent
...,0,...,...
...,1,...,...

Output 2: 
hostname,gpu_id,gpu_util,mem_used_percent
...,0,...,...
...,1,...,...
...,0,...,...
...,1,...,...

Example Output (JSON)

JSON Command
gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-a.ndjson --once
cat /tmp/telemetry-a.ndjson

gcm nvml_monitor --sink=telemetry --sink-opt file_path=/tmp/telemetry-c.ndjson --once
cat /tmp/telemetry-c.ndjson

Output 1:
{"hostname":"...","gpu_id":0,"gpu_util":...}
{"hostname":"...","gpu_id":1,"gpu_util":...}

Output 2: 
{"hostname":"...","gpu_id":0,"gpu_util":...}

Implementation

  • ~60 lines of code in gcm/exporters/telemetry.py
  • Follows existing exporter conventions (@register, write(Log, SinkAdditionalParams))
  • Works with gcm nvml_monitor, gcm slurm_monitor, and health check commands
  • Auto-creates parent directories if needed

Testing

pytest gcm/tests/test_telemetry_exporter.py -v

============================= test session starts ==============================
collected 10 items
...
============================== 10 passed in <time> =============================


$ gcm nvml_monitor --sink=telemetry--sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ cat /tmp/telemetry.csv

Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user


1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048
Expected structure when schema changes mid-stream:
normal.job_id,normal.state,normal.user
1,RUNNING,alice
normal.gpu_uuid,normal.memory_used_mb
GPU-123,2048

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 4, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 8, 2026

@luccabb could I get some feedback?

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

theap06 added 4 commits March 7, 2026 23:27
…research#87)

Add a new 'telemetry' sink that periodically appends telemetry snapshots
to a local file in JSON or CSV format for offline analysis.

- JSON: NDJSON format (one object per line)
- CSV: Header on first write, append rows
- Options: file_path (required), format (json|csv, default json)
- Works with nvml_monitor, slurm_monitor, and health checks

Closes facebookresearch#87

Made-with: Cursor
@theap06 theap06 force-pushed the feat/telemetry-export-clean branch from 94e1f2c to 4f00cf7 Compare March 8, 2026 07:28
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 8, 2026

@theap06 we already support json to files at the file exporter, maybe add CSV shape support there?

https://facebookresearch.github.io/gcm/docs/GCM_Monitoring/exporters/file/

added support for csv

@theap06 theap06 requested a review from luccabb March 8, 2026 07:29
Comment thread gcm/exporters/telemetry.py Outdated
Comment thread gcm/exporters/file.py Outdated
Comment thread gcm/exporters/file.py
theap06 and others added 2 commits March 8, 2026 16:59
Co-authored-by: lucca bertoncini <32229669+luccabb@users.noreply.github.com>
@theap06 theap06 requested a review from luccabb March 9, 2026 00:03
@theap06 theap06 force-pushed the feat/telemetry-export-clean branch from 184a18e to 060fafe Compare March 9, 2026 09:03
@theap06 theap06 requested a review from A-Kokolis as a code owner March 9, 2026 09:03
- Add format option (json|csv) to file exporter
- CSV output uses flattened scuba message structure
- First write adds header row, subsequent writes append
- Update docs for both Monitoring and Health Checks

Made-with: Cursor
@theap06 theap06 force-pushed the feat/telemetry-export-clean branch from 060fafe to 961b5b7 Compare March 9, 2026 09:43
Comment thread gcm/exporters/file.py
Comment thread gcm/exporters/file.py Outdated
@theap06 theap06 requested a review from luccabb March 10, 2026 09:02
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 10, 2026

@luccabb Could i get some feedback?

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Claude finished @luccabb's task in 47s —— View job


Re-review: Schema-Change Fix Not Implemented

  • Read CLAUDE.md
  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Verify schema-change fix
  • Post review feedback

@theap06 Unfortunately the critical issue has not been fixed. The code still writes a new header mid-file when the schema changes, producing an invalid CSV.

What the code still does (broken)

gcm/exporters/file.py lines 119–131:

if previous_fieldnames != fieldnames:
    header_buf = io.StringIO()
    header_writer = csv.DictWriter(header_buf, fieldnames=all_keys, ...)
    header_writer.writeheader()
    logger.info(header_buf.getvalue())   # ← writes new header to the SAME file
    self._csv_fieldnames[path] = fieldnames

When the schema changes, a new header row is written into the same file via the same logger. This produces the same invalid output as before:

job_id,state,user
1,RUNNING,alice
gpu_uuid,memory_used_mb    ← second header inserted mid-file
GPU-123,2048

Test still validates the broken behavior

gcm/tests/test_file_exporter.py lines 62–67:

lines = path.read_text().splitlines()
assert lines == [
    "job_id,state,user",
    "1,RUNNING,alice",
    "gpu_uuid,memory_used_mb",   # ← still asserts mixed-header output as "correct"
    "GPU-123,2048",
]

The test explicitly asserts this broken output as expected behavior — nothing has changed here.

What needs to happen

When a schema change is detected, a new file should be created instead of inserting a header into the existing file. For example:

  • First schema → telemetry.csv
  • Second schema → telemetry_1.csv
  • Third schema → telemetry_2.csv

The test should then verify that two separate, valid CSV files are created — each parseable independently.

Fix this →

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 12, 2026

@claude made the changes. Can you review my pr?

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 12, 2026

@claude

@claude
Copy link
Copy Markdown

claude bot commented Mar 12, 2026

Claude finished @luccabb's task in 54s —— View job


Re-review: Schema-Change Fix Verified ✓

  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Verify the schema-change fix
  • Post review feedback

The critical issue has been fixed correctly this time. Here's what was implemented:

What Changed

New helper (file.py:29-33):

def _schema_versioned_path(path: str, schema_index: int) -> str:
    if schema_index == 0:
        return path
    stem, ext = os.path.splitext(path)
    return f"{stem}_{schema_index}{ext}"

New tracking dicts (file.py:64-67):

  • _data_identifier_to_base_path — remembers the original path for generating versioned names
  • _csv_schema_index — tracks the current rollover counter per DataIdentifier

Schema-change rollover logic (file.py:139-158): When the schema changes, a new versioned file is created (e.g., data.csvdata_1.csv), a new logger is initialized pointing to it, and the path/logger maps are updated. The subsequent if previous_fieldnames != fieldnames check then correctly writes the header to the new file.

Test updated (test_file_exporter_csv_rolls_over_on_schema_change, lines 45-72): Verifies two separate, valid CSV files are produced — data.csv with the first schema and data_1.csv with the second schema. Each file is independently parseable.


Remaining Minor Nit (non-blocking)

DictWriter created per-record inside the loop (file.py:173-182):

for record in records:
    row_buf = io.StringIO()
    row_writer = csv.DictWriter(row_buf, fieldnames=all_keys, ...)
    row_writer.writerow(record)
    logger.info(row_buf.getvalue())

A new StringIO and DictWriter is allocated for every single record. The DictWriter could be created once outside the loop and only the StringIO reset per record. Minor performance issue, low priority.


Verdict

The critical CSV schema-change bug is fixed. The implementation is clean, follows existing logger/path patterns, and the test correctly validates the rollover behavior. This is ready to merge pending maintainer approval.

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 12, 2026

@claude fixed the perf issues.

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 12, 2026

@luccabb should be good to go now. maybe we can add a ci workflow like /rubric or something for quick feedback on prs. tool calls instead of always needing to call claude api.

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theap06 keep the test plan updated as you make changes

@theap06 theap06 requested a review from luccabb March 12, 2026 20:02
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 12, 2026

@luccabb done

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test plan is wrong. add the execution cmd (cat filename) and output from that cmd for the following lines.

you should likely be passing more than 1 filename, given that you'd want data to be written to different files. please test this thoroughly and ensure test plans are fully complete before sending back for review

@theap06 theap06 requested a review from luccabb March 13, 2026 06:07
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 13, 2026

test plan is wrong. add the execution cmd (cat filename) and output from that cmd for the following lines.

you should likely be passing more than 1 filename, given that you'd want data to be written to different files. please test this thoroughly and ensure test plans are fully complete before sending back for review

Sorry about that! I just updated the the test plan

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 14, 2026

@claude review my pr

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 14, 2026

@claude

@claude
Copy link
Copy Markdown

claude bot commented Mar 14, 2026

Claude finished @luccabb's task in 1m 18s —— View job


PR Review: CSV Support for File Exporter

  • Read CLAUDE.md and repo context
  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Read documentation changes
  • Analyze code quality, correctness, and completeness
  • Post review feedback

Summary

All previously identified issues have been addressed. The implementation is clean and correct.

What's Been Fixed

  1. Schema-change rollover (file.py:139-158): When CSV schema changes, a new versioned file is created (data.csvdata_1.csv). Each file is a valid, independently parseable CSV. ✓

  2. DictWriter reuse (file.py:173-184): DictWriter and StringIO are now created once outside the record loop; only seek(0) + truncate(0) are called per record. ✓

  3. Tests (test_file_exporter.py:45-72): test_file_exporter_csv_rolls_over_on_schema_change correctly verifies two separate valid CSV files are produced. ✓

  4. Documentation (GCM_Monitoring/exporters/file.md): The CSV section clearly explains the format option and schema-rollover behavior. ✓


Minor Nits (non-blocking)

1. Conditional initialization of _csv_fieldnames and _csv_schema_index (file.py:65-67):

if self.format == "csv":
    self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}
    self._csv_schema_index: Dict[DataIdentifier, int] = {}

These attributes only exist when format == "csv". Since the write() method gates all CSV access behind if self.format == "csv", this works correctly in practice — but it means mypy/type checkers may not see these as always-defined. Initializing them unconditionally as {} would be slightly cleaner, but it's not a bug.

2. GCM Health Checks doc not updated for CSV (website/docs/GCM_Health_Checks/exporters/file.md):

The health checks doc now has the format option in the table (line 17) but doesn't include the CSV section that was added to the monitoring doc. Minor inconsistency — the monitoring doc explains the rollover behavior but the health checks doc doesn't.

3. No test for multiple successive schema changes:

Only one rollover is tested (data.csvdata_1.csv). A test covering data_1.csvdata_2.csv would give higher confidence in the index-increment logic (file.py:144), but the implementation looks correct by inspection.


Verdict

The critical bug is fixed, performance nit is addressed, tests are correct, and documentation is updated. This is ready to merge. The nits above are minor and non-blocking.

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 14, 2026

@luccabb should be ready to merge

Copy link
Copy Markdown
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I run this PR twice it breaks CSV structure, headers shouldn't be repeated in the middle of the file

$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ gcm nvml_monitor --sink=file --sink-opt file_path=/tmp/telemetry.csv --sink-opt format=csv --once
$ cat /tmp/telemetry.csv 
gpu_id,gpu_util,hostname,job_id,mem_used_percent,mem_util,power_draw,power_used_percent,retired_pages_count_double_bit,retired_pages_count_single_bit,temperature
0,0,host0101,-1,0,0,33111,14,0,0,38
1,0,host0101,-1,0,0,30435,12,0,0,33
gpu_id,gpu_util,hostname,job_id,mem_used_percent,mem_util,power_draw,power_used_percent,retired_pages_count_double_bit,retired_pages_count_single_bit,temperature
0,0,host0101,-1,0,0,33111,14,0,0,37
1,0,host0101,-1,0,0,30206,12,0,0,33

@luccabb
Copy link
Copy Markdown
Member

luccabb commented Mar 17, 2026

@claude

@claude
Copy link
Copy Markdown

claude bot commented Mar 17, 2026

Claude finished @luccabb's task in 1m 32s —— View job


Re-review: New Bug — Header Repeated on Second Run

  • Read CLAUDE.md
  • Read gcm/exporters/file.py
  • Read gcm/tests/test_file_exporter.py
  • Analyze root cause
  • Post review feedback

Bug Confirmed: Duplicate Header on Second Process Invocation

luccabb's diagnosis is exactly right. Here is the precise chain of events.

Root cause — gcm/exporters/file.py

__init__ (line 66):

self._csv_fieldnames: Dict[str, Tuple[str, ...]] = {}

_csv_fieldnames is always initialised as an empty dict on every new process start. It holds the only memory of "what fieldnames were already written to this file path."

write (lines 137 and 161–171):

previous_fieldnames = self._csv_fieldnames.get(path)   # line 137 — always None on 2nd run

...

if previous_fieldnames != fieldnames:                   # line 161 — None != anything → True
    header_writer.writeheader()                         # line 169 — header written again
    logger.info(header_buf.getvalue())                  # line 170
    self._csv_fieldnames[path] = fieldnames             # line 171

On the first run, previous_fieldnames starts as None, so the header is written and _csv_fieldnames[path] is set. On every subsequent write() call within the same process, previous_fieldnames == fieldnames and no header is re-emitted — correct.

On the second run (a fresh process invocation against the same file), _csv_fieldnames is empty again. self._csv_fieldnames.get(path) returns None, the condition None != fieldnames is True, and the header is written a second time mid-file.

What the tests miss

test_file_exporter_csv (line 25) and test_file_exporter_csv_rolls_over_on_schema_change (line 45) both create a fresh File object and a fresh file in tmp_path — they never simulate a second process opening an existing file. There is no test for the "append to existing file" scenario.

Fix needed

When __init__ registers a path and that path already exists on disk with at least one line, read the first line and use it to pre-populate _csv_fieldnames[path]. Something along the lines of:

# inside __init__, after each path is resolved and before init_logger
if self.format == "csv" and os.path.isfile(resolved_path):
    with open(resolved_path, newline="") as fh:
        reader = csv.reader(fh)
        try:
            existing_header = tuple(next(reader))
        except StopIteration:
            existing_header = None
    if existing_header:
        self._csv_fieldnames[resolved_path] = existing_header

This seeds the in-memory dict before the first write() call so that previous_fieldnames != fieldnames is False when the schema is unchanged, suppressing the duplicate header.

A matching test should create a File, write rows, then create a second File pointing to the same path and write more rows, asserting that csv.DictReader sees the header exactly once.

Summary

Detail
File gcm/exporters/file.py
Lines __init__ line 66 (dict init), write lines 137 and 161–171 (header guard)
Root cause _csv_fieldnames is in-memory only; second invocation sees empty dict and re-emits header
Fix Seed _csv_fieldnames from the first line of any pre-existing file during __init__
Missing test No test covers "second process, same file" scenario

Fix this →

@theap06 theap06 requested a review from luccabb March 18, 2026 06:21
@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 18, 2026

@claude review my pr

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Mar 25, 2026

@luccabb could I get feedback on this pr

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Apr 1, 2026

@luccabb it should be good to merge

@theap06
Copy link
Copy Markdown
Contributor Author

theap06 commented Apr 5, 2026

@luccabb I'll close this pr for now/

@theap06 theap06 closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀[Feature Request]: Structured Telemetry Export (CSV/JSON) for Offline Analysis

2 participants