Skip to content

feat: add fr_FR locale to nemotron personas datasets#468

Open
johnnygreco wants to merge 4 commits intomainfrom
add-fr-fr-locale
Open

feat: add fr_FR locale to nemotron personas datasets#468
johnnygreco wants to merge 4 commits intomainfrom
add-fr-fr-locale

Conversation

@johnnygreco
Copy link
Contributor

Summary

  • Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES, which auto-propagates to LOCALES_WITH_MANAGED_DATASETS, PersonaRepository, PersonSamplerParams validation, and the download service
  • Add 7 France-specific PII fields to dataset_based_person_fields.py: first_name_heritage, name_heritage, is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement
  • Update person sampling docs with fr_FR locale listing, NGC download example, and field reference
  • Update persona repository tests for 8 locales

Register the France locale (fr_FR, 2.71 GB) in NEMOTRON_PERSONAS_DATASET_SIZES
and add 7 France-specific PII fields: first_name_heritage, name_heritage,
is_first_gen_immigrant, household_type, monthly_income_eur, commune, departement.
@johnnygreco johnnygreco requested a review from a team as a code owner March 25, 2026 20:52
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 25, 2026

Greptile Summary

This PR registers the fr_FR (France) locale in the Nemotron Personas dataset ecosystem and adds 7 France-specific PII fields, completing the full integration path from configuration to CLI to documentation.

Key changes:

  • NEMOTRON_PERSONAS_DATASET_SIZES in constants.py gets "fr_FR": "2.71 GB", which auto-propagates to LOCALES_WITH_MANAGED_DATASETS, PersonSamplerParams validation, the download service, and the persona repository — no other registration steps are needed by design.
  • A new LOCALES_WITH_MANAGED_DATASETS_STR constant is introduced as a bonus clean-up, replacing two inline ', '.join(...) calls and fixing a pre-existing staleness bug in the CLI --locale help text (which previously omitted en_SG and pt_BR).
  • Seven France-specific PII fields (commune, departement, household_type, monthly_income_eur, first_name_heritage, name_heritage, is_first_gen_immigrant) are appended to PII_FIELDS in locale-alphabetical order.
  • All four affected test files are updated with the new locale count (7 → 8) and explicit fr_FR assertions.
  • Documentation covers the NGC download snippet, field reference, and supported-locale parameter table.

Confidence Score: 5/5

  • This PR is safe to merge — it follows the established locale-addition pattern exactly, includes all necessary test updates, and includes a net improvement to the CLI help text.
  • The change is mechanical and well-scoped: one dict entry drives the full propagation, the new LOCALES_WITH_MANAGED_DATASETS_STR constant is a clean DRY improvement, all test counts and assertions are updated consistently, and documentation is thorough. No logic changes, no new failure modes.
  • No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/utils/constants.py Adds fr_FR to NEMOTRON_PERSONAS_DATASET_SIZES in alphabetical order and introduces LOCALES_WITH_MANAGED_DATASETS_STR to DRY up repeated join calls; clean and consistent.
packages/data-designer-engine/src/data_designer/engine/sampling_gen/entities/dataset_based_person_fields.py Adds 7 France-specific PII fields (commune, departement, household_type, monthly_income_eur, first_name_heritage, name_heritage, is_first_gen_immigrant) in the correct locale-ordered position within PII_FIELDS.
packages/data-designer-config/src/data_designer/config/sampler_params.py Replaces two inline ', '.join(LOCALES_WITH_MANAGED_DATASETS) calls with the pre-computed LOCALES_WITH_MANAGED_DATASETS_STR constant; purely mechanical refactor with no logic change.
packages/data-designer/src/data_designer/cli/commands/download.py Replaces a stale hardcoded locale list in the CLI help text (previously missing en_SG and pt_BR) with the dynamic LOCALES_WITH_MANAGED_DATASETS_STR constant — a net improvement over the original.
docs/concepts/person_sampling.md Adds fr_FR to the supported-locales list, NGC download snippet, France-specific field reference table, and the parameter-table locale enum; documentation is comprehensive.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["NEMOTRON_PERSONAS_DATASET_SIZES\n(constants.py)\n+ fr_FR: 2.71 GB"] --> B["LOCALES_WITH_MANAGED_DATASETS\n(list of locale keys)"]
    A --> C["LOCALES_WITH_MANAGED_DATASETS_STR\n(comma-joined string) NEW"]
    B --> D["PersonSamplerParams.locale\nvalidation & field description"]
    B --> E["PersonaRepository._registry\n(locales list)"]
    B --> F["DownloadService\nget_available_locales()"]
    C --> D
    C --> G["CLI --locale help text\n(download.py)"]
    H["dataset_based_person_fields.py\n+ 7 fr_FR PII fields"] --> I["PII_FIELDS list\nused by sampling engine"]
Loading

Reviews (4): Last reviewed commit: "refactor: add LOCALES_WITH_MANAGED_DATAS..." | Re-trigger Greptile

Update hardcoded locale counts from 7 to 8 and add fr_FR assertions
in download controller and download service tests.
The --locale help text was hardcoded and already stale (missing en_SG,
pt_BR, fr_FR). Build it from LOCALES_WITH_MANAGED_DATASETS so it stays
in sync automatically.
Centralise the comma-joined locale list so it is defined once in
constants and reused in the CLI help text, PersonSamplerParams field
description, and locale validation error message.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant