Fix parser pad token handling and add loss mask tests by guojasonliu · Pull Request #5 · togethercomputer/aurora

guojasonliu · 2026-04-14T23:25:16Z

Summary

Preserve valid tokenizer pad token ids by only falling back to unk_token_id when pad_token_id is None.
Add parser tests for pad-token handling, assistant loss-mask spans, last_turn_only, and truncation.

Motivation

Aurora builds draft-model training targets from parsed conversation traces. The assistant loss_mask determines which tokens contribute to the training signal. Treating token id 0 as missing can change behavior for tokenizers that validly use 0 as padding.

Testing

python3 -m pytest tests/test_parse.py -q
python3 -m ruff check aurora/data/parse.py tests/test_parse.py
python3 -m ruff format --check aurora/data/parse.py tests/test_parse.py

Fix parser pad token handling and add loss mask tests

2221b69

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parser pad token handling and add loss mask tests#5

Fix parser pad token handling and add loss mask tests#5
guojasonliu wants to merge 1 commit intotogethercomputer:mainfrom
guojasonliu:jason-fix-parser-pad-token

guojasonliu commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

guojasonliu commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

guojasonliu commented Apr 14, 2026 •

edited

Loading