NVIDIA-NeMo · sephmard · Feb 22, 2026 · Feb 23, 2026
@@ -22,6 +22,8 @@ Config Generation Progress:
 - [ ] Step 8: Run the evaluation
 ```
 
+**Note on asking questions**: Throughout this workflow, whenever you need to ask the user a question, use the `AskUserQuestion` tool if your environment provides it (e.g. Claude Code). Otherwise, ask in chat and wait for the user's response before proceeding.
+
 **Step 1: Check if nel is installed**
 
 Test that `nel` is installed with `nel --version`.
@@ -30,32 +32,31 @@ If not, instruct the user to `pip install nemo-evaluator-launcher`.
 
 **Step 2: Build the base config file**
 
-Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Guide the user through the 5 questions using AskUserQuestion:
+Prompt the user with "I'll ask you 5 questions to build the base config we'll adjust in the next steps". Ask questions one at a time sequentially using AskUserQuestion if available, otherwise ask in chat. AskUserQuestion has a hard limit of 4 options per question — for questions with more options, show the top 3 most common options plus a "Let's chat about it" option. If the user selects "Let's chat about it", ask them in chat to clarify their choice from the full list before proceeding.
 
-1. Execution:
+1. Execution — use AskUserQuestion if available:
   - Local
   - SLURM
-2. Deployment:
+2. Deployment — use AskUserQuestion if available, show top 3 + "Let's chat about it":
   - None (External)
   - vLLM
   - SGLang
-  - NIM
-  - TRT-LLM
-3. Auto-export:
+  - Let's chat about it *(for NIM, TRT-LLM, or other)*
+  Full option list if user selects "Let's chat about it": None (External), vLLM, SGLang, NIM, TRT-LLM
+3. Auto-export — use AskUserQuestion if available:
   - None (auto-export disabled)
   - MLflow
   - wandb
-4. Model type
+4. Model type — use AskUserQuestion if available:
   - Base
   - Chat
   - Reasoning
-5. Benchmarks:
-  Allow for multiple choices in this question.
-  1. Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
-  2. Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
-  3. Math & Reasoning (like AIME, GPQA, MATH-500, ...)
-  4. Safety & Security (like Garak and Safety Harness)
-  5. Multilingual (like MMATH, Global MMLU, MMLU-Prox)
+5. Benchmarks — use AskUserQuestion if available (multi-select), show top 3 + "Let's chat about it":
+  - Standard LLM Benchmarks (like MMLU, IFEval, GSM8K, ...)
+  - Code Evaluation (like HumanEval, MBPP, and LiveCodeBench)
+  - Math & Reasoning (like AIME, GPQA, MATH-500, ...)
+  - Let's chat about it *(for Safety & Security, Multilingual, or combinations)*
+  Full option list if user selects "Let's chat about it": Standard LLM Benchmarks, Code Evaluation, Math & Reasoning, Safety & Security (Garak, Safety Harness), Multilingual (MMATH, Global MMLU, MMLU-Prox)
 
 DON'T ALLOW FOR ANY OTHER OPTIONS, only the ones listed above under each category (Execution, Deployment, Auto-export, Model type, Benchmarks). YOU HAVE TO GATHER THE ANSWERS for the 5 questions before you can build the base config.
 
@@ -75,7 +76,7 @@ It never overwrites existing files.
 
 **Step 3: Configure model path and parameters**
 
-Ask for model path. Determine type:
+Use AskUserQuestion if available to ask for the model path, otherwise ask in chat. Determine type:
 
 - Checkpoint path (starts with `/` or `./`) → set `deployment.checkpoint_path: <path>` and `deployment.hf_model_handle: null`
 - HF handle (e.g., `org/model-name`) → set `deployment.hf_model_handle: <handle>` and `deployment.checkpoint_path: null`
@@ -84,12 +85,12 @@ Use WebSearch to find model card (HuggingFace, build.nvidia.com). Read it carefu
 
 - Sampling params (`temperature`, `top_p`)
 - Context length (`deployment.extra_args: "--max-model-len <value>"`)
-- TP/DP settings (to set them appropriately, AskUserQuestion on how many GPUs the model will be deployed)
+- TP/DP settings (to set them appropriately, use AskUserQuestion if available to ask how many GPUs the model will be deployed on, otherwise ask in chat)
 - Reasoning config (if applicable):
   - reasoning on/off: use either:
     - `adapter_config.custom_system_prompt` (like `/think`, `/no_think`) and no `adapter_config.params_to_add` (leave `params_to_add` unrelated to reasoning untouched)
     - `adapter_config.params_to_add` for payload modifier (like `"chat_template_kwargs": {"enable_thinking": true/false}`) and no `adapter_config.custom_system_prompt` and `adapter_config.use_system_prompt: false` (leave `custom_system_prompt` and `use_system_prompt` unrelated to reasoning untouched).
-  - reasoning effort/budget (if it's configurable, AskUserQuestion what reasoning effort they want)
+  - reasoning effort/budget (if it's configurable, use AskUserQuestion if available to ask what reasoning effort they want, otherwise ask in chat)
   - higher `max_new_tokens`
   - etc.
 - Deployment-specific `extra_args` for vLLM/SGLang (look for the vLLM/SGLang deployment command)
@@ -109,15 +110,15 @@ Present findings, explain each setting, ask user to confirm or adjust. If no mod
 **Step 4: Fill in remaining missing values**
 
 - Find all remaining `???` missing values in the config.
-- Ask the user only for values that couldn't be auto-discovered from the model card (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI). Don't propose any defaults here. Let the user give you the values in plain text.
-- Ask the user if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled).
+- Use AskUserQuestion if available to ask for each missing value (e.g., SLURM hostname, account, output directory, MLflow/wandb tracking URI), otherwise ask in chat. Don't propose any defaults here. Let the user give you the values in plain text.
+- Use AskUserQuestion if available to ask if they want to change any other defaults e.g. execution partition or walltime (if running on SLURM) or add MLflow/wandb tags (if auto-export enabled), otherwise ask in chat.
 
 **Step 5: Confirm tasks (iterative)**
 
 Show tasks in the current config. Loop until the user confirms the task list is final:
 
 1. Tell the user: "Run `nel ls tasks` to see all available tasks".
-2. Ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides.
+2. Use AskUserQuestion if available to ask if they want to add/remove tasks or add/remove/modify task-specific parameter overrides, otherwise ask in chat.
    To add per-task `nemo_evaluator_config` as specified by the user, e.g.:
    ```yaml
    tasks:
@@ -130,7 +131,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
              ...
    ```
 3. Apply changes.
-4. Show updated list and ask: "Is the task list final, or do you want to make more changes?"
+4. Show updated list and use AskUserQuestion if available to ask: "Is the task list final, or do you want to make more changes?", otherwise ask in chat.
 
 **Known Issues**
 
@@ -146,7 +147,7 @@ Show tasks in the current config. Loop until the user confirms the task list is
 
 Only if model >120B parameters, suggest multi-node. Explain: "This is DP multi-node - the weights are copied (not distributed) across nodes. One deployment instance per node will be run with HAProxy load-balancing requests."
 
-Ask if user wants multi-node. If yes, ask for node count and configure:
+Use AskUserQuestion if available to ask if the user wants multi-node and, if yes, for the node count. Otherwise ask in chat. Then configure:
 
 ```yaml
 execution: