Skip to content

Fix parser pad token handling and add loss mask tests#5

Open
guojasonliu wants to merge 1 commit intotogethercomputer:mainfrom
guojasonliu:jason-fix-parser-pad-token
Open

Fix parser pad token handling and add loss mask tests#5
guojasonliu wants to merge 1 commit intotogethercomputer:mainfrom
guojasonliu:jason-fix-parser-pad-token

Conversation

@guojasonliu
Copy link
Copy Markdown

@guojasonliu guojasonliu commented Apr 14, 2026

Summary

  • Preserve valid tokenizer pad token ids by only falling back to unk_token_id when pad_token_id is None.
  • Add parser tests for pad-token handling, assistant loss-mask spans, last_turn_only, and truncation.

Motivation

Aurora builds draft-model training targets from parsed conversation traces. The assistant loss_mask determines which tokens contribute to the training signal. Treating token id 0 as missing can change behavior for tokenizers that validly use 0 as padding.

Testing

  • python3 -m pytest tests/test_parse.py -q
  • python3 -m ruff check aurora/data/parse.py tests/test_parse.py
  • python3 -m ruff format --check aurora/data/parse.py tests/test_parse.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant