EleutherAI / lm-evaluation-harness Public

Notifications You must be signed in to change notification settings
Fork 2.2k
Star 8.2k

Code
Issues 386
Pull requests 110
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: EleutherAI/lm-evaluation-harness

reproduce llama 3 evals

#2557 opened Dec 10, 2024 by baberabb

Open 6

Labels 10 Milestones 1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

386 Open 894 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

mmlu_pro bug in fewshot + chat_template

#2780 opened Mar 9, 2025 by Moreh-LeeJunhyeok

CUDA Out of Memory

#2779 opened Mar 9, 2025 by seermer

Multi-NPU evaluation supported?

#2776 opened Mar 7, 2025 by yuhkalhic

Evaluating Pretrained LM always need few-shot example?

#2775 opened Mar 7, 2025 by SpeeeedLee

Deviation in HumanEval Benchmark Results

#2774 opened Mar 7, 2025 by ds-anik

MMLU COT Giving less accuracy

#2770 opened Mar 7, 2025 by Rajshree-Sahu

Layer-by-layer inference evaluation of large models.

#2767 opened Mar 6, 2025 by LLIKKE

Add AIME 2024 and LiveCodeBenchmark to the gold standard evaluation harness feature request

A feature that isn't implemented yet.

help wanted

Contributors and extra help welcome.

#2766 opened Mar 6, 2025 by Allen-labs

cluade sonnet 3.5 and 3.7 humaneval 0%

#2764 opened Mar 6, 2025 by whitepaper82

HF batch_size=auto unreliable bug

Something isn't working.

feature request

A feature that isn't implemented yet.

#2758 opened Mar 4, 2025 by ds-anik

API Model: Custom handling of refused prompt

#2756 opened Mar 4, 2025 by jonoillar

'NoneType' object is not callable!

#2752 opened Mar 3, 2025 by tensorflowt

Smooth landing errors during post processing

#2751 opened Feb 28, 2025 by ksurya

Embedding checkpoint size mismatch when using peft on DeepSeek-R1-Distill-Qwen-1.5B.

#2748 opened Feb 28, 2025 by Phoenix-Shen

Gemini Support and usage

#2747 opened Feb 27, 2025 by IsraelAbebe

HOW TO ADD NEW TASK?

#2745 opened Feb 27, 2025 by amdslgl

modelscope installed will lead some problems

#2744 opened Feb 27, 2025 by jijivski

Error loading MMLU 'prehistory' config: BuilderConfig not found (available: ['default'])

#2743 opened Feb 27, 2025 by ruio248

Creating a new task with data in chat format (openai)

#2741 opened Feb 26, 2025 by leandermaben

An error occurred: 'choices' (in openai chat completion)

#2740 opened Feb 26, 2025 by Raghadalr02

Issue with the Tokenizer of Pixtral-12B-2409

#2731 opened Feb 24, 2025 by aminfarajian

Batching and generate_until special tokens

#2723 opened Feb 21, 2025 by sjmielke

Get acc_norm for HF models in log_samples feature request

A feature that isn't implemented yet.

#2722 opened Feb 21, 2025 by Kartik21

How to preprocess a document with the assistance of a tokenizer from a specific Model

#2717 opened Feb 20, 2025 by p1nksnow

Different models on same tasks gives same results when cache is active bug

Something isn't working.

#2715 opened Feb 19, 2025 by salvatore-cipolla

Previous 1 2 3 4 5 … 15 16 Next

Previous Next

ProTip! no:milestone will show everything without a milestone.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly