Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ examples @akoumpa @HuiyingLi @adil-a @hemildesai @ZhiyuLi-Nvidia
README.md @akoumpa @HuiyingLi @snowmanwwg

nemo_automodel/components/datasets/llm/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
biencoder/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
encoder/ @akoumpa @adil-a @HuiyingLi @hemildesai @NVIDIA-NeMo/automodel_retriever_maintainers
14 changes: 7 additions & 7 deletions docs/guides/dataset-overview.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Dataset Overview: LLM, VLM, and Retrieval Datasets in NeMo Automodel

This page summarizes the datasets supported in NeMo Automodel for LLM, VLM, and retrieval/embedding (biencoder) training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.
This page summarizes the datasets supported in NeMo Automodel for LLM, VLM, and retrieval/embedding (encoder) training and shows how to plug in your own datasets using Python functions or the YAML `_target_` mechanism.

- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Biencoder retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.
- See also: [LLM datasets](llm/dataset.md), [VLM datasets](vlm/dataset.md), and [Encoder retrieval dataset](llm/retrieval-dataset.md) for deeper, task-specific guides.

- If a dataset you need is missing, please open a [GitHub issue](https://github.com/NVIDIA-NeMo/Automodel/issues) with a short description and example schema so we can prioritize support.
---
Expand Down Expand Up @@ -220,9 +220,9 @@ dataset:
```
See the [Function Calling guide](llm/toolcalling.md) for an end-to-end example with FunctionGemma.

### Retrieval/Biencoder (Embedding Fine-Tuning)
### Retrieval/Encoder (Embedding Fine-Tuning)
- Factory: `nemo_automodel.components.datasets.llm.make_retrieval_dataset`
- Collator: `nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator`
- Collator: `nemo_automodel.components.datasets.llm.RetrievalEncoderCollator`
- Use case: embedding model fine-tuning with (query, positive doc, negative docs) contrastive learning
- Supported schemas:
- Corpus-ID JSON (Merlin/NeMo-retriever style)
Expand All @@ -233,13 +233,13 @@ dataset:
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
data_dir_list: /abs/path/to/train.jsonl
data_type: train
train_n_passages: 5
n_passages: 5
collate_fn:
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
_target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
q_max_len: 512
p_max_len: 512
```
See the detailed guide, [Biencoder retrieval dataset](llm/retrieval-dataset.md), for more information.
See the detailed guide, [Encoder retrieval dataset](llm/retrieval-dataset.md), for more information.

- **NanoGPT Binary Shards (pretraining)**
- Class: `nemo_automodel.components.datasets.llm.nanogpt_dataset.NanogptDataset`
Expand Down
16 changes: 8 additions & 8 deletions docs/guides/llm/retrieval-dataset.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Biencoder Retrieval Dataset (Embedding Fine-tuning)
# Encoder Retrieval Dataset (Embedding Fine-tuning)

NeMo Automodel supports **biencoder/embedding model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.
NeMo Automodel supports **encoder/embedding model fine-tuning** using a retrieval-style dataset: each training example is a **query** paired with **one positive** document and **one or more negative** documents.

This dataset is used by the biencoder recipes (see `examples/biencoder/`) together with the `RetrievalBiencoderCollator`.
This dataset is used by the encoder recipes (see `examples/encoder/bi_encoder/` and `examples/encoder/cross_encoder/`) together with the `RetrievalEncoderCollator`.

## What the Biencoder Consumes
## What the Encoder Consumes

The dataset factory `nemo_automodel.components.datasets.llm.make_retrieval_dataset` returns a Hugging Face `datasets.Dataset`. At runtime it transforms each raw record into the training-time schema:

Expand Down Expand Up @@ -78,7 +78,7 @@ This is convenient for custom fine-tuning pipelines where the documents are incl

## YAML Usage (Dataset + Collator)

Use the dataset factory plus the biencoder collator:
Use the dataset factory plus the encoder collator:

```yaml
dataloader:
Expand All @@ -88,11 +88,11 @@ dataloader:
data_dir_list:
- /abs/path/to/train.jsonl # or train.json (corpus-id format)
data_type: train
train_n_passages: 5 # 1 positive + 4 negatives
n_passages: 5 # 1 positive + 4 negatives
do_shuffle: true
use_dataset_instruction: false
collate_fn:
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
_target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
q_max_len: 512
p_max_len: 512
query_prefix: "query:"
Expand All @@ -103,4 +103,4 @@ dataloader:
## Requirements

- `pos_doc` must be **non-empty**.
- If training requests negatives (e.g., `train_n_passages > 1`), `neg_doc` must contain **at least one** document (the loader will cycle negatives if you provide fewer than needed).
- If training requests negatives (e.g., `n_passages > 1`), `neg_doc` must contain **at least one** document (the loader will cycle negatives if you provide fewer than needed).
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,10 @@
from __future__ import annotations

from nemo_automodel.components.config._arg_parser import parse_args_and_load_config
from nemo_automodel.recipes.biencoder import TrainBiencoderRecipe
from nemo_automodel.recipes.encoder import TrainRetrieverEncoderRecipe


def main(default_config_path="examples/biencoder/llama3_2_1b_biencoder.yaml"):
def main(default_config_path="examples/encoder/bi_encoder/llama3_2_1b.yaml"):
"""Main entry point for the biencoder fine-tuning recipe.

Loads the configuration, sets up the recipe, and initiates the training loop.
Expand All @@ -27,7 +27,7 @@ def main(default_config_path="examples/biencoder/llama3_2_1b_biencoder.yaml"):
default_config_path: Path to the default configuration file
"""
cfg = parse_args_and_load_config(default_config_path)
recipe = TrainBiencoderRecipe(cfg)
recipe = TrainRetrieverEncoderRecipe(cfg)
recipe.setup()
recipe.run_train_validation_loop()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,12 @@
# limitations under the License.

# To run this recipe, please use the following command:
# python examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
# python examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama3_2_1b.yaml
# Or with torchrun for multi-GPU:
# torchrun --nproc-per-node=8 examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
# torchrun --nproc-per-node=8 examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama3_2_1b.yaml

seed: 42

train_n_passages: 5
eval_negative_size: 4
temperature: 0.02

step_scheduler:
Expand All @@ -35,9 +33,8 @@ dist_env:
timeout_minutes: 1

model:
_target_: nemo_automodel.NeMoAutoModelBiencoder.from_pretrained
_target_: nemo_automodel.NeMoAutoModelBiEncoder.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.2-1B
share_encoder: true
pooling: avg
l2_normalize: true
use_liger_kernel: true
Expand All @@ -52,16 +49,16 @@ dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
model_type: biencoder
data_dir_list:
- hf://nvidia/embed-nemotron-dataset-v1/FEVER
- hf://nvidia/embed-nemotron-dataset-v1/SyntheticClassificationData
data_type: train
train_n_passages: 5
eval_negative_size: 4
n_passages: 5
seed: 42
do_shuffle: true
collate_fn:
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
_target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
q_max_len: 512
p_max_len: 512
query_prefix: "query:"
Expand All @@ -75,16 +72,16 @@ dataloader:
# _target_: torchdata.stateful_dataloader.StatefulDataLoader
# dataset:
# _target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
# model_type: biencoder
# data_dir_list: training_datasets/validation.json
# data_type: eval
# train_n_passages: 5
# eval_negative_size: 4
# n_passages: 5
# seed: 42
# do_shuffle: false
# max_train_samples: 1000
# train_data_select_offset: 0
# collate_fn:
# _target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
# _target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
# q_max_len: 512
# p_max_len: 512
# query_prefix: "query:"
Expand Down Expand Up @@ -121,13 +118,12 @@ optimizer:
bias_correction: true
master_weights: true

# Learning rate scheduler
lr_scheduler:
lr_warmup_steps: 100

checkpoint:
enabled: true
checkpoint_dir: ./output/llama3_2_1b_biencoder/checkpoints
checkpoint_dir: ./output/llama3_2_1b_encoder/checkpoints
model_save_format: safetensors
save_consolidated: true

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ This guide provides step-by-step instructions to reproduce the training pipeline
Download and prepare the `nvidia/embed-nemotron-dataset-v1` dataset from [Hugging Face](https://huggingface.co/datasets/nvidia/embed-nemotron-dataset-v1). This dataset is a selected subset of the fine-tuning data used for training the `llama-embed-nemotron-8b` model:

```python
python examples/biencoder/llama_embed_nemotron_8b/data_preparation.py \
python examples/encoder/bi_encoder/llama_embed_nemotron_8b/data_preparation.py \
--download-path ./embed_nemotron_dataset_v1
```

Expand All @@ -24,8 +24,8 @@ This script will download the dataset and prepare it for training.
Run the model finetuning with the specified configuration using 8 GPUs:

```bash
torchrun --nproc-per-node=8 examples/biencoder/finetune.py \
--config examples/biencoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
torchrun --nproc-per-node=8 examples/encoder/bi_encoder/finetune.py \
--config examples/encoder/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
```

The final model checkpoint in Hugging Face format will be stored in `output/llama_embed_nemotron_8b/epoch_0_step_28614/model/consolidated`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@
Downloads and restores datasets from HuggingFace repositories.

Usage:
python examples/biencoder/llama_embed_nemotron_8b/data_preparation.py --download-path path/to/dataset
python examples/encoder/bi_encoder/llama_embed_nemotron_8b/data_preparation.py --download-path path/to/dataset

Example:
python examples/biencoder/llama_embed_nemotron_8b/data_preparation.py \
python examples/encoder/bi_encoder/llama_embed_nemotron_8b/data_preparation.py \
--download-path ./embed_nemotron_dataset_v1
"""

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,12 @@
# limitations under the License.

# To run this recipe, please use the following command:
# python examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
# python examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml
# Or with torchrun for multi-GPU:
# torchrun --nproc-per-node=8 examples/biencoder/finetune.py --config examples/biencoder/llama3_2_1b_biencoder.yaml
# torchrun --nproc-per-node=8 examples/encoder/bi_encoder/finetune.py --config examples/encoder/bi_encoder/llama_embed_nemotron_8b/llama_embed_nemotron_8b.yaml

seed: 125

train_n_passages: 5
temperature: 0.02

step_scheduler:
Expand All @@ -30,7 +29,7 @@ step_scheduler:
num_epochs: 1

model:
_target_: nemo_automodel.NeMoAutoModelBiencoder.from_pretrained
_target_: nemo_automodel.NeMoAutoModelBiEncoder.from_pretrained
pretrained_model_name_or_path: meta-llama/Llama-3.1-8B
pooling: avg
torch_dtype: bfloat16
Expand All @@ -43,6 +42,7 @@ dataloader:
_target_: torchdata.stateful_dataloader.StatefulDataLoader
dataset:
_target_: nemo_automodel.components.datasets.llm.make_retrieval_dataset
model_type: biencoder
data_dir_list:
- ./embed_nemotron_dataset_v1/EmotionClassification/EmotionClassification.json
- ./embed_nemotron_dataset_v1/FEVER/FEVER.json
Expand All @@ -59,12 +59,12 @@ dataloader:
- ./embed_nemotron_dataset_v1/SyntheticClassificationData/SyntheticClassificationData.json
- ./embed_nemotron_dataset_v1/TriviaQA/TriviaQA.json
data_type: train
train_n_passages: 5
n_passages: 5
seed: 125
do_shuffle: true
use_dataset_instruction: true
collate_fn:
_target_: nemo_automodel.components.datasets.llm.RetrievalBiencoderCollator
_target_: nemo_automodel.components.datasets.llm.RetrievalEncoderCollator
q_max_len: 512
p_max_len: 512
query_prefix: ""
Expand Down
29 changes: 29 additions & 0 deletions examples/encoder/cross_encoder/finetune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import annotations

from nemo_automodel.components.config._arg_parser import parse_args_and_load_config
from nemo_automodel.recipes.encoder import TrainCrossEncoderRecipe


def main(default_config_path="examples/encoder/cross_encoder/llama3_2_1b.yaml"):
cfg = parse_args_and_load_config(default_config_path)
recipe = TrainCrossEncoderRecipe(cfg)
recipe.setup()
recipe.run_train_validation_loop()


if __name__ == "__main__":
main()
Loading