Skip to content

Add multi-book sliding evaluation#1121

Open
ahmeda14960 wants to merge 53 commits intomemorizefrom
codex/explain-careless-eval-lm-and-fsdp-handling
Open

Add multi-book sliding evaluation#1121
ahmeda14960 wants to merge 53 commits intomemorizefrom
codex/explain-careless-eval-lm-and-fsdp-handling

Conversation

@ahmeda14960
Copy link
Copy Markdown
Contributor

Summary

  • add eval_sliding_total.py driver to run careless suffix likelihood over many books from one config
  • default output plot/data names derive from book_title when not specified
  • include YAML covering all bundled books with automatic per-book subdirectories on GCS

Testing

  • pytest tests/test_background_iterable.py -k test_reentrancy -q (fails: ImportError: cannot import name 'PositionalSharding' from 'jax.sharding')
  • pip install --quiet "jax==0.4.26" "jaxlib==0.4.26" -f https://storage.googleapis.com/jax-releases/jax_releases.html (fails: Could not find a version that satisfies the requirement jax==0.4.26)
  • pip install --quiet pre-commit && pre-commit run --files src/levanter/main/eval_careless_lm.py src/levanter/main/eval_sliding_total.py config/books/eval_sliding_total.yaml (fails: Could not find a version that satisfies the requirement pre-commit)

https://chatgpt.com/codex/tasks/task_e_68916b50a600832789d90019a3188e1b

dlwh and others added 30 commits April 3, 2025 21:55
## Summary
- add `eval_sliding_lm.py` for evaluation of cached single-turn chat
datasets
- document how to use this evaluation script

## Testing
- `ruff check src/levanter/main/eval_sliding_lm.py`
- `pytest -k none` *(fails: Could not install dependencies / huggingface
network blocked)*

------
https://chatgpt.com/codex/tasks/task_e_6858e07113ec8327bc506e21be5f5efc
## Summary
- allow specifying HF dataset IDs in evaluation scripts by using
`SingleDatasetLMConfig`
- update `eval_lm.py`, `eval_sliding_lm.py`, `viz_logprobs.py`, and
`lora_lm.py`

## Testing
- `black` formatting
- `isort` import ordering
- `pytest -k "eval_lm or viz_lm" -q` *(fails: ModuleNotFoundError:
tensorboardX)*

------
https://chatgpt.com/codex/tasks/task_e_6858da279ef88327a9f11a4d206dab07
## Summary
- support `initialize_from_hf` and `use_hf_model_config` in evaluation
scripts
- document using `eval_sliding_lm.py` with HF checkpoints

## Testing
- `black src/levanter/main/eval_lm.py
src/levanter/main/eval_sliding_lm.py`
- `isort src/levanter/main/eval_lm.py
src/levanter/main/eval_sliding_lm.py`
- `pytest -k "eval_lm or viz_lm" -q` *(fails: ModuleNotFoundError:
tensorboardX)*


------
https://chatgpt.com/codex/tasks/task_e_6858da279ef88327a9f11a4d206dab07
## Summary
- support `initialize_from_hf` and `use_hf_model_config` in evaluation
scripts
- document using `eval_sliding_lm.py` with HF checkpoints
- update sliding evaluation config to load dataset from GCS

## Testing
- `black src/levanter/main/eval_lm.py
src/levanter/main/eval_sliding_lm.py`
- `isort src/levanter/main/eval_lm.py
src/levanter/main/eval_sliding_lm.py`
- `pytest -k "eval_lm or viz_lm" -q` *(fails: ModuleNotFoundError:
tensorboardX)*

------
https://chatgpt.com/codex/tasks/task_e_6858da279ef88327a9f11a4d206dab07
## Summary
- handle scalar and 1-D arrays returned by `compute_log_probs`

## Testing
- `ruff check .` *(fails: `F401 levanter.analysis imported but unused`
and other errors)*
- `pytest -q` *(fails: `ModuleNotFoundError: No module named
'tensorboardX'`)*

------
https://chatgpt.com/codex/tasks/task_e_68599c29706483278cffaffce8f442ef
Merge branch 'books3' of github.com:stanford-crfm/levanter into books3
## Summary
- integrate DataLoader into `eval_careless_lm.py`
- compute per-window probabilities on device
- add `use_dataloader` option documenting the tradeoff

## Testing
- `black src/levanter/main/eval_careless_lm.py`
- `pytest -q` *(fails: ModuleNotFoundError: No module named 'torch')*


------
https://chatgpt.com/codex/tasks/task_e_68841b9691448327bde32d2d8d99cfb0
## Summary
- allow token-based analysis for eval_careless_lm
- implement utilities for sliding windows over token ids
- update Gatsby example config to show token_mode usage

## Testing
- `pre-commit run --files src/levanter/books/util.py
src/levanter/main/eval_careless_lm.py
config/books/eval_careless_llama3.1_70b_gatsby.yaml` *(fails: command
not found)*
- `pytest -k histogram -q` *(fails: ImportError: cannot import name
'PositionalSharding')*

------
https://chatgpt.com/codex/tasks/task_e_688bfac097388327a23154a5a6f999b1
Base automatically changed from books3 to memorize September 11, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants