NVIDIA-NeMo · adil-a · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026 · Mar 4, 2026
@@ -9,4 +9,4 @@ examples @akoumpa @HuiyingLi @adil-a @hemildesai @ZhiyuLi-Nvidia
 README.md @akoumpa @HuiyingLi @snowmanwwg
 
 nemo_automodel/components/datasets/llm/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
-biencoder/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
+encoder/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
@@ -1,8 +1,8 @@
 # Dataset Overview: LLM, VLM, and Retrieval Datasets in NeMo Automodel
 
-This page summarizes the datasets supported in NeMo Automodel for LLM, VLM, and retrieval/embedding (biencoder) training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
+This page summarizes the datasets supported in NeMo Automodel for LLM, VLM, and retrieval/embedding (encoder) training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
 
-- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Biencoder retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.
+- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Encoder retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.
 
 - If a dataset you need is missing, please open a [GitHub issue](https://github.com/NVIDIA-NeMo/Automodel/issues) with a short description and example schema so we can prioritize support.
 ---
@@ -220,9 +220,9 @@ dataset:
 ```
 See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.
 
-### Retrieval/Biencoder (Embedding Fine-Tuning)
+### Retrieval/Encoder (Embedding Fine-Tuning)
 - Factory: `nemo_automodel.components.datasets.llm.make_retrieval_dataset`
-- Collator: `nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator`
+- Collator: `nemo_automodel.components.datasets.llm.RetrievalEncoderCollator`
 - Use case: embedding model fine-tuning with (query, positive doc, negative docs) contrastive learning
 - Supported schemas:
   - Corpus-ID JSON (Merlin/NeMo-retriever style)
@@ -233,13 +233,13 @@ dataset:
   _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
   data_dir_list: /abs/path/to/train.jsonl
   data_type: train
-  train_n_passages: 5
+  n_passages: 5
 collate_fn:
-  _target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
+  _target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
   q_max_len: 512
   p_max_len: 512
 ```
-See the detailed guide, [Biencoder retrieval dataset](llm/retrieval-dataset.md), for more information.
+See the detailed guide, [Encoder retrieval dataset](llm/retrieval-dataset.md), for more information.
 
 - **NanoGPT Binary Shards (pretraining)**
   - Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`

@@ -1,10 +1,10 @@
-# Biencoder Retrieval Dataset (Embedding Fine-tuning)
+# Encoder Retrieval Dataset (Embedding Fine-tuning)
 
-NeMo Automodel supports **biencoder/embedding model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.
+NeMo Automodel supports **encoder/embedding model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.
 
-This dataset is used by the biencoder recipes (see `examples/biencoder/`) together with the `RetrievalBiencoderCollator`.
+This dataset is used by the encoder recipes (see `examples/encoder/bi_encoder/` and `examples/encoder/cross_encoder/`) together with the `RetrievalEncoderCollator`.
 
-## What the Biencoder Consumes
+## What the Encoder Consumes
 
 The dataset factory `nemo_automodel.components.datasets.llm.make_retrieval_dataset` returns a Hugging Face `datasets.Dataset`. At runtime it transforms each raw record into the training-time schema:
 
@@ -78,7 +78,7 @@ This is convenient for custom fine-tuning pipelines where the documents are incl
 
 ## YAML Usage (Dataset + Collator)
 
-Use the dataset factory plus the biencoder collator:
+Use the dataset factory plus the encoder collator:
 
 ```yaml
 dataloader:
@@ -88,11 +88,11 @@ dataloader:
     data_dir_list:
       - /abs/path/to/train.jsonl   # or train.json (corpus-id format)
     data_type: train
-    train_n_passages: 5           # 1 positive + 4 negatives
+    n_passages: 5                 # 1 positive + 4 negatives
     do_shuffle: true
     use_dataset_instruction: false
   collate_fn:
-    _target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
+    _target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
     q_max_len: 512
     p_max_len: 512
     query_prefix: "query:"
@@ -103,4 +103,4 @@ dataloader:
 ## Requirements
 
 - `pos_doc` must be **non-empty**.
-- If training requests negatives (e.g., `train_n_passages > 1`), `neg_doc` must contain **at least one** document (the loader will cycle negatives if you provide fewer than needed).
+- If training requests negatives (e.g., `n_passages > 1`), `neg_doc` must contain **at least one** document (the loader will cycle negatives if you provide fewer than needed).
@@ -15,10 +15,10 @@
 from __future__ import annotations
 
 from nemo_automodel.components.config._arg_parser import parse_args_and_load_config
-from nemo_automodel.recipes.biencoder import TrainBiencoderRecipe
+from nemo_automodel.recipes.encoder import TrainRetrieverEncoderRecipe
 
 
-def main(default_config_path="examples/biencoder/llama3_2_1b_biencoder.yaml"):
+def main(default_config_path="examples/encoder/bi_encoder/llama3_2_1b.yaml"):
     """Main entry point for the biencoder fine-tuning recipe.
 
     Loads the configuration, sets up the recipe, and initiates the training loop.
@@ -27,7 +27,7 @@ def main(default_config_path="examples/biencoder/llama3_2_1b_biencoder.yaml"):
         default_config_path: Path to the default configuration file
     """
     cfg = parse_args_and_load_config(default_config_path)
-    recipe = TrainBiencoderRecipe(cfg)
+    recipe = TrainRetrieverEncoderRecipe(cfg)
     recipe.setup()
     recipe.run_train_validation_loop()
 

@@ -13,14 +13,12 @@
 # limitations under the License.
 
 # To run this recipe, please use the following command:
-# python examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
+# python examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama3_2_1b.yaml
 # Or with torchrun for multi-GPU:
-# torchrun --nproc-per-node=8 examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
+# torchrun --nproc-per-node=8 examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama3_2_1b.yaml
 
 seed: 42
 
-train_n_passages: 5
-eval_negative_size: 4
 temperature: 0.02
 
 step_scheduler:
@@ -35,9 +33,8 @@ dist_env:
   timeout_minutes: 1
 
 model:
-  _target_: nemo_automodel.NeMoAutoModelBiencoder.from_pretrained
+  _target_: nemo_automodel.NeMoAutoModelBiEncoder.from_pretrained
   pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
-  share_encoder: true
   pooling: avg
   l2_normalize: true
   use_liger_kernel: true
@@ -52,16 +49,16 @@ dataloader:
   _target_: torchdata.stateful_dataloader.StatefulDataLoader
   dataset:
     _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
+    model_type: biencoder
     data_dir_list:
       - hf://nvidia/embed-nemotron-dataset-v1/FEVER
       - hf://nvidia/embed-nemotron-dataset-v1/SyntheticClassificationData
     data_type: train
-    train_n_passages: 5
-    eval_negative_size: 4
+    n_passages: 5
     seed: 42
     do_shuffle: true
   collate_fn:
-    _target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
+    _target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
     q_max_len: 512
     p_max_len: 512
     query_prefix: "query:"
@@ -75,16 +72,16 @@ dataloader:
 #   _target_: torchdata.stateful_dataloader.StatefulDataLoader
 #   dataset:
 #     _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
+#     model_type: biencoder
 #     data_dir_list: training_datasets/validation.json
 #     data_type: eval
-#     train_n_passages: 5
-#     eval_negative_size: 4
+#     n_passages: 5
 #     seed: 42
 #     do_shuffle: false
 #     max_train_samples: 1000
 #     train_data_select_offset: 0
 #   collate_fn:
-#     _target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
+#     _target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
 #     q_max_len: 512
 #     p_max_len: 512
 #     query_prefix: "query:"
@@ -121,13 +118,12 @@ optimizer:
   bias_correction: true
   master_weights: true
 
-# Learning rate scheduler
 lr_scheduler:
   lr_warmup_steps: 100
 
 checkpoint:
   enabled: true
-  checkpoint_dir: ./output/llama3_2_1b_biencoder/checkpoints
+  checkpoint_dir: ./output/llama3_2_1b_encoder/checkpoints
   model_save_format: safetensors
   save_consolidated: true
 

@@ -13,7 +13,7 @@ This guide provides step-by-step instructions to reproduce the training pipeline
 Download and prepare the `nvidia/embed-nemotron-dataset-v1` dataset from [Hugging Face](https://huggingface.co/datasets/nvidia/embed-nemotron-dataset-v1). This dataset is a selected subset of the fine-tuning data used for training the `llama-embed-nemotron-8b` model:
 
 ```python
-python examples/biencoder/llama_embed_nemotron_8b/data_preparation.py \
+python examples/encoder/bi_encoder/llama_embed_nemotron_8b/data_preparation.py \
     --download-path ./embed_nemotron_dataset_v1
 ```
 
@@ -24,8 +24,8 @@ This script will download the dataset and prepare it for training.
 Run the model finetuning with the specified configuration using 8 GPUs:
 
 ```bash
-torchrun --nproc-per-node=8 examples/biencoder/finetune.py \
-    --config examples/biencoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
+torchrun --nproc-per-node=8 examples/encoder/bi_encoder/finetune.py \
+    --config examples/encoder/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
 ```
 
 The final model checkpoint in Hugging Face format will be stored in `output/llama_embed_nemotron_8b/epoch_0_step_28614/model/consolidated`

@@ -16,10 +16,10 @@
 Downloads and restores datasets from HuggingFace repositories.
 
 Usage:
-    python examples/biencoder/llama_embed_nemotron_8b/data_preparation.py --download-path path/to/dataset
+    python examples/encoder/bi_encoder/llama_embed_nemotron_8b/data_preparation.py --download-path path/to/dataset
 
 Example:
-    python examples/biencoder/llama_embed_nemotron_8b/data_preparation.py \
+    python examples/encoder/bi_encoder/llama_embed_nemotron_8b/data_preparation.py \
         --download-path ./embed_nemotron_dataset_v1
 """
 

@@ -13,13 +13,12 @@
 # limitations under the License.
 
 # To run this recipe, please use the following command:
-# python examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
+# python examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
 # Or with torchrun for multi-GPU:
-# torchrun --nproc-per-node=8 examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
+# torchrun --nproc-per-node=8 examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
 
 seed: 125
 
-train_n_passages: 5
 temperature: 0.02
 
 step_scheduler:
@@ -30,7 +29,7 @@ step_scheduler:
   num_epochs: 1
 
 model:
-  _target_: nemo_automodel.NeMoAutoModelBiencoder.from_pretrained
+  _target_: nemo_automodel.NeMoAutoModelBiEncoder.from_pretrained
   pretrained_model_name_or_path: meta-llama/Llama-3.1-8B
   pooling: avg
   torch_dtype: bfloat16
@@ -43,6 +42,7 @@ dataloader:
   _target_: torchdata.stateful_dataloader.StatefulDataLoader
   dataset:
     _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
+    model_type: biencoder
     data_dir_list:
       - ./embed_nemotron_dataset_v1/EmotionClassification/EmotionClassification.json
       - ./embed_nemotron_dataset_v1/FEVER/FEVER.json
@@ -59,12 +59,12 @@ dataloader:
       - ./embed_nemotron_dataset_v1/SyntheticClassificationData/SyntheticClassificationData.json
       - ./embed_nemotron_dataset_v1/TriviaQA/TriviaQA.json
     data_type: train
-    train_n_passages: 5
+    n_passages: 5
     seed: 125
     do_shuffle: true
     use_dataset_instruction: true
   collate_fn:
-    _target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
+    _target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
     q_max_len: 512
     p_max_len: 512
     query_prefix: ""

@@ -0,0 +1,29 @@
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import annotations
+
+from nemo_automodel.components.config._arg_parser import parse_args_and_load_config
+from nemo_automodel.recipes.encoder import TrainCrossEncoderRecipe
+
+
+def main(default_config_path="examples/encoder/cross_encoder/llama3_2_1b.yaml"):
+    cfg = parse_args_and_load_config(default_config_path)
+    recipe = TrainCrossEncoderRecipe(cfg)
+    recipe.setup()
+    recipe.run_train_validation_loop()
+
+
+if __name__ == "__main__":
+    main()