fix(sde,flowdppo): derive KL sigma_t from each SDE strategy (flow/dance/cps) by Jayce-Ping · Pull Request #38 · Tencent-Hunyuan/UniRL

Jayce-Ping · 2026-06-12T02:08:17Z

Summary

FlowDPPO._compute_sigma_t hardcoded the Flow-SDE std, so the KL normalizer was wrong for Dance and CPS (and, at the first sigma == 1 step, for Flow too). This adds a single source of truth on the SDE strategies -- _std_dev_t + transition_std -- and routes each strategy's step() through them (numerically identical). FlowDPPO._compute_sigma_t now delegates to strategy.transition_std, so the KL (d_mean)^2 / (2 * std^2) uses each strategy's actual transition std: Flow/Dance use std_dev_t * sqrt(-dt), CPS uses std_dev_t (no sqrt(-dt)). This also folds in the earlier fix to Flow's sigma == 1 denominator (sigma_max = sigmas[1] instead of a 0.99 clamp).

Related Issue

N/A

Test Plan

Pure-Python numeric check (no local torch env): the new transition_std equals each strategy's original step() std at representative sigmas including sigma == 1 for Flow; CPS omits sqrt(-dt) (e.g. 0.859 vs 0.162 at step 0); sigma < 1 Flow/Dance steps are unchanged.
Not run; reason: no local Python/torch environment (pre-commit, pytest, Hydra config validation, and training/rollout smoke tests were not run). The step() refactor is numerically identical (same formula relocated into _std_dev_t); a rollout smoke on GPU is recommended.

Compatibility / Risk

Low-to-moderate. Sampling/log_prob math is unchanged (each step()'s std_dev_t is the same formula moved into _std_dev_t). The behavior change is limited to the FlowDPPO KL normalizer, which is now correct per strategy (previously only Flow was handled, and Flow was wrong at sigma == 1). No config, checkpoint, data-format, or API changes. The new abstract SDEStrategy._std_dev_t is implemented by all SDE strategies (Flow/Dance/CPS); ODE/DPM2 is not an SDEStrategy and is unaffected.

Reviewer Notes

AI-assisted. Single-source-of-truth refactor so step() and the KL share one std definition. Scope is limited to sigma_t consistency; other FlowDPPO gaps (reference-KL, advantage clipping, EMA) remain out of scope. Supersedes the earlier sigma == 1-only commit on this branch (now folded in).

Checklist

I reviewed the changed code and removed unrelated/generated artifacts.
I updated tests, docs, and configs where needed, or explained why not (no test infra exists for this path; verified numerically).

FlowDPPO._compute_sigma_t used clamp(s, max=0.99) for the variance denominator, which disagreed with FlowSDEStrategy.step's where(sigma==1, sigmas[1], sigma). This underestimated the first (sigma==1) step's KL by ~3.6x, so the highest-noise step was almost never masked. Use the same sigma_max=sigmas[1] denominator so the KL-normalization sigma_t equals the transition's std_var at every step; sigma<1 steps are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>

Add SDEStrategy._std_dev_t + transition_std as the single source for the per-step transition std, and route Flow/Dance/CPS step() through them (numerically identical). FlowDPPO._compute_sigma_t now delegates to strategy.transition_std, so the KL normalizer matches each strategy: Flow/Dance use std_dev_t*sqrt(-dt); CPS uses std_dev_t (no sqrt(-dt)). Subsumes the earlier sigma==1 Flow fix (now in FlowSDEStrategy._std_dev_t). Co-authored-by: Cursor <cursoragent@cursor.com>

Jayce-Ping and others added 2 commits June 12, 2026 10:07

Jayce-Ping changed the title ~~fix(flowdppo): align KL sigma_t with SDE transition at sigma==1~~ fix(sde,flowdppo): derive KL sigma_t from each SDE strategy (flow/dance/cps) Jun 12, 2026

Jayce-Ping requested a review from haonan3 June 12, 2026 03:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sde,flowdppo): derive KL sigma_t from each SDE strategy (flow/dance/cps)#38

fix(sde,flowdppo): derive KL sigma_t from each SDE strategy (flow/dance/cps)#38
Jayce-Ping wants to merge 2 commits into
Tencent-Hunyuan:mainfrom
Jayce-Ping:fix/flowdppo-sigma-t

Jayce-Ping commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Jayce-Ping commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Test Plan

Compatibility / Risk

Reviewer Notes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Jayce-Ping commented Jun 12, 2026 •

edited

Loading