Add multi-book sliding evaluation by ahmeda14960 · Pull Request #1121 · marin-community/levanter

ahmeda14960 · 2025-08-05T03:45:34Z

Summary

add eval_sliding_total.py driver to run careless suffix likelihood over many books from one config
default output plot/data names derive from book_title when not specified
include YAML covering all bundled books with automatic per-book subdirectories on GCS

Testing

pytest tests/test_background_iterable.py -k test_reentrancy -q (fails: ImportError: cannot import name 'PositionalSharding' from 'jax.sharding')
pip install --quiet "jax==0.4.26" "jaxlib==0.4.26" -f https://storage.googleapis.com/jax-releases/jax_releases.html (fails: Could not find a version that satisfies the requirement jax==0.4.26)
pip install --quiet pre-commit && pre-commit run --files src/levanter/main/eval_careless_lm.py src/levanter/main/eval_sliding_total.py config/books/eval_sliding_total.yaml (fails: Could not find a version that satisfies the requirement pre-commit)

https://chatgpt.com/codex/tasks/task_e_68916b50a600832789d90019a3188e1b

## Summary - add `eval_sliding_lm.py` for evaluation of cached single-turn chat datasets - document how to use this evaluation script ## Testing - `ruff check src/levanter/main/eval_sliding_lm.py` - `pytest -k none` *(fails: Could not install dependencies / huggingface network blocked)* ------ https://chatgpt.com/codex/tasks/task_e_6858e07113ec8327bc506e21be5f5efc

## Summary - allow specifying HF dataset IDs in evaluation scripts by using `SingleDatasetLMConfig` - update `eval_lm.py`, `eval_sliding_lm.py`, `viz_logprobs.py`, and `lora_lm.py` ## Testing - `black` formatting - `isort` import ordering - `pytest -k "eval_lm or viz_lm" -q` *(fails: ModuleNotFoundError: tensorboardX)* ------ https://chatgpt.com/codex/tasks/task_e_6858da279ef88327a9f11a4d206dab07

## Summary - support `initialize_from_hf` and `use_hf_model_config` in evaluation scripts - document using `eval_sliding_lm.py` with HF checkpoints ## Testing - `black src/levanter/main/eval_lm.py src/levanter/main/eval_sliding_lm.py` - `isort src/levanter/main/eval_lm.py src/levanter/main/eval_sliding_lm.py` - `pytest -k "eval_lm or viz_lm" -q` *(fails: ModuleNotFoundError: tensorboardX)* ------ https://chatgpt.com/codex/tasks/task_e_6858da279ef88327a9f11a4d206dab07

## Summary - support `initialize_from_hf` and `use_hf_model_config` in evaluation scripts - document using `eval_sliding_lm.py` with HF checkpoints - update sliding evaluation config to load dataset from GCS ## Testing - `black src/levanter/main/eval_lm.py src/levanter/main/eval_sliding_lm.py` - `isort src/levanter/main/eval_lm.py src/levanter/main/eval_sliding_lm.py` - `pytest -k "eval_lm or viz_lm" -q` *(fails: ModuleNotFoundError: tensorboardX)* ------ https://chatgpt.com/codex/tasks/task_e_6858da279ef88327a9f11a4d206dab07

## Summary - handle scalar and 1-D arrays returned by `compute_log_probs` ## Testing - `ruff check .` *(fails: `F401 levanter.analysis imported but unused` and other errors)* - `pytest -q` *(fails: `ModuleNotFoundError: No module named 'tensorboardX'`)* ------ https://chatgpt.com/codex/tasks/task_e_68599c29706483278cffaffce8f442ef

Merge branch 'books3' of github.com:stanford-crfm/levanter into books3

…to books3

## Summary - integrate DataLoader into `eval_careless_lm.py` - compute per-window probabilities on device - add `use_dataloader` option documenting the tradeoff ## Testing - `black src/levanter/main/eval_careless_lm.py` - `pytest -q` *(fails: ModuleNotFoundError: No module named 'torch')* ------ https://chatgpt.com/codex/tasks/task_e_68841b9691448327bde32d2d8d99cfb0

## Summary - allow token-based analysis for eval_careless_lm - implement utilities for sliding windows over token ids - update Gatsby example config to show token_mode usage ## Testing - `pre-commit run --files src/levanter/books/util.py src/levanter/main/eval_careless_lm.py config/books/eval_careless_llama3.1_70b_gatsby.yaml` *(fails: command not found)* - `pytest -k histogram -q` *(fails: ImportError: cannot import name 'PositionalSharding')* ------ https://chatgpt.com/codex/tasks/task_e_688bfac097388327a23154a5a6f999b1

dlwh and others added 30 commits April 3, 2025 21:55

amazingly, we can read safetensor form gcs

fee2b33

make the interface clear

3af2d05

remove write since they won't be efficient

dd49e6b

fix tests

62698cd

wip

c6f7242

better

f02a9e8

add cheat sheet

2ba00bf

WIP

66931fc

update configs

e53fcbe

try out safe tensors efficient loading

6542422

Merge remote-tracking branch 'origin/main' into safetensor_gcs

12fb69f

update

5dc1322

Merge remote-tracking branch 'origin/safetensor_gcs' into books3

b57209f

debug

7839dd9

add back docker fix

5e4a2c1

fix configs

8f8d7cd

revert back to og checkpointing

9c68299

fix config

d06e822

precommit

2d853d2

praying this is a fix

f4b5b6b

now trying 8b

0cea88f

try to debug the 8b

670a31d

merge

69c2005

messed up llama 3.1 config....

36e5181

hacky way to compare levanter with HF

5a4a2c2

ahmeda14960 added 23 commits June 27, 2025 16:24

doing something reasonable (?)

7834c30

log prob test?

4521d77

just test final activations and compare

daa1614

test norms

54e6d20

spinning wheels

992c4aa

sliding attempt

ace845c

got something working for the sliding window max prob

9b720f9

more prefix tests

3bfd321

:q

b55536e

Merge branch 'books3' of github.com:stanford-crfm/levanter into books3

more prefix tests

7cb42d9

more prefix tests

60c3491

attention is the culprit

997f1f1

sliding logits working on tpu v4-8

c78efec

Merge branch 'books3' of https://github.com/stanford-crfm/levanter in…

c8ca69d

…to books3

tp on tpus working

0cac51d

getting heatmap plots on levanter

c77077a

fix padding bug

4665bbc

add histogram now

1f0e6ae

5 books and configs

b2afe3a

remove files

5293ad4

Add multi-book YAML and auto naming for eval outputs

352698e

ahmeda14960 added the codex label Aug 5, 2025 — with ChatGPT Codex Connector

ahmeda14960 force-pushed the books3 branch from b22b5c0 to 207a90d Compare September 11, 2025 23:33

Base automatically changed from books3 to memorize September 11, 2025 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-book sliding evaluation#1121

Add multi-book sliding evaluation#1121
ahmeda14960 wants to merge 53 commits intomemorizefrom
codex/explain-careless-eval-lm-and-fsdp-handling

ahmeda14960 commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahmeda14960 commented Aug 5, 2025

Summary

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants