Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Config Generation Progress:
- [ ] Step 8: Run the evaluation
```

**Note on asking questions**: Throughout this workflow, whenever you need to ask the user a question, use the `AskUserQuestion` tool if your environment provides it (e.g. Claude Code). Otherwise, ask in chat and wait for the user's response before proceeding.

**Step 1: Check if nel is installed**

Test that `nel` is installed with `nel --version`.
Expand All @@ -30,32 +32,31 @@ If not, instruct the user to `pip install nemo-evaluator-launcher`.

**Step 2: Build the base config file**

Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Ask questions one at a time sequentially using AskUserQuestion if available, otherwise ask in chat. AskUserQuestion has a hard limit of 4 options per question — for questions with more options, show the top 3 most common options plus a "Let's chat about it" option. If the user selects "Let's chat about it", ask them in chat to clarify their choice from the full list before proceeding.

1. Execution:
1. Execution — use AskUserQuestion if available:
- Local
- SLURM
2. Deployment:
2. Deployment — use AskUserQuestion if available, show top 3 + "Let's chat about it":
- None (External)
- vLLM
- SGLang
- NIM
- TRT-LLM
3. Auto-export:
- Let's chat about it *(for NIM, TRT-LLM, or other)*
Full option list if user selects "Let's chat about it": None (External), vLLM, SGLang, NIM, TRT-LLM
3. Auto-export — use AskUserQuestion if available:
- None (auto-export disabled)
- MLflow
- wandb
4. Model type
4. Model type — use AskUserQuestion if available:
- Base
- Chat
- Reasoning
5. Benchmarks:
Allow for multiple choices in this question.
1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
3. Math & Reasoning (like AIME, GPQA, MATH-500, ...)
4. Safety & Security (like Garak and Safety Harness)
5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)
5. Benchmarks — use AskUserQuestion if available (multi-select), show top 3 + "Let's chat about it":
- Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
- Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
- Math & Reasoning (like AIME, GPQA, MATH-500, ...)
- Let's chat about it *(for Safety & Security, Multilingual, or combinations)*
Full option list if user selects "Let's chat about it": Standard LLM Benchmarks, Code Evaluation, Math & Reasoning, Safety & Security (Garak, Safety Harness), Multilingual (MMATH, Global MMLU, MMLU-Prox)

DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.

Expand All @@ -75,7 +76,7 @@ It never overwrites existing files.

**Step 3: Configure model path and parameters**

Ask for model path. Determine type:
Use AskUserQuestion if available to ask for the model path, otherwise ask in chat. Determine type:

- Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`
- HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`
Expand All @@ -84,12 +85,12 @@ Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefu

- Sampling params (`temperature`, `top_p`)
- Context length (`deployment.extra_args: "--max-model-len <value>"`)
- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
- TP/DP settings (to set them appropriately, use AskUserQuestion if available to ask how many GPUs the model will be deployed on, otherwise ask in chat)
- Reasoning config (if applicable):
- reasoning on/off: use either:
- `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
- `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
- reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
- reasoning effort/budget (if it's configurable, use AskUserQuestion if available to ask what reasoning effort they want, otherwise ask in chat)
- higher `max_new_tokens`
- etc.
- Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
Expand All @@ -109,15 +110,15 @@ Present findings, explain each setting, ask user to confirm or adjust. If no mod
**Step 4: Fill in remaining missing values**

- Find all remaining `???` missing values in the config.
- Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
- Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).
- Use AskUserQuestion if available to ask for each missing value (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI), otherwise ask in chat. Don't propose any defaults here. Let the user give you the values in plain text.
- Use AskUserQuestion if available to ask if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled), otherwise ask in chat.

**Step 5: Confirm tasks (iterative)**

Show tasks in the current config. Loop until the user confirms the task list is final:

1. Tell the user: "Run `nel ls tasks` to see all available tasks".
2. Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides.
2. Use AskUserQuestion if available to ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides, otherwise ask in chat.
To add per-task `nemo_evaluator_config` as specified by the user, e.g.:
```yaml
tasks:
Expand All @@ -130,7 +131,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
...
```
3. Apply changes.
4. Show updated list and ask: "Is the task list final, or do you want to make more changes?"
4. Show updated list and use AskUserQuestion if available to ask: "Is the task list final, or do you want to make more changes?", otherwise ask in chat.

**Known Issues**

Expand All @@ -146,7 +147,7 @@ Show tasks in the current config. Loop until the user confirms the task list is

Only if model >120B parameters, suggest multi-node. Explain: "This is DP multi-node - the weights are copied (not distributed) across nodes. One deployment instance per node will be run with HAProxy load-balancing requests."

Ask if user wants multi-node. If yes, ask for node count and configure:
Use AskUserQuestion if available to ask if the user wants multi-node and, if yes, for the node count. Otherwise ask in chat. Then configure:

```yaml
execution:
Expand Down
Loading