Skip to content

DO-NOT-MERGE: re-validate #670 on post-flagship main#730

Closed
azchohfi wants to merge 15 commits into
mainfrom
sweep-670
Closed

DO-NOT-MERGE: re-validate #670 on post-flagship main#730
azchohfi wants to merge 15 commits into
mainfrom
sweep-670

Conversation

@azchohfi

Copy link
Copy Markdown
Collaborator

Throwaway re-validation of #670 on origin/main b9ace1e (=#692+M14+#665+#649). DO NOT MERGE; origin #670 pristine.

azchohfi and others added 15 commits June 26, 2026 21:57
YogaNode property setters (FlexDirection/Justify/Align/Width/Height/
Min/Max/margins/padding/border/position/gap/...) called
MarkDirtyAndPropagate unconditionally. FlexPanel re-applies the same
container and child style every MeasureOverride, so the root and every
cell were re-dirtied each frame and the Yoga layout cache never hit.

Add 'if (current == value) return;' guards to every setter so unchanged
values do not dirty the tree, mirroring upstream Yoga's updateStyle.
Route FlexPanel.ApplyAttachedProperties direct node.Style writes
(FlexGrow/FlexShrink/FlexBasis/AlignSelf/PositionType/Position) through
the guarded setters so a genuinely changed flex item still relayouts.

Add unit tests asserting same-value assignment does not dirty, a real
change does, and that re-applying identical style keeps the layout
cache hot (child measure function not re-invoked).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
)

YogaStyle allocated 8 separate YogaValue[] arrays per instance
(Margin/Position/Padding/Border x9, Gap x3, Dimensions/Min/Max x2).
For a ~200-node tree that is ~1600 GC objects, a dominant Yoga memory
regression vs the C++ original which uses inline members.

Replace the arrays with fixed-size InlineArray structs (EdgeValues,
GutterValues, DimensionValues) embedded directly in the YogaStyle heap
object. Exposed as ref-returning properties so existing indexer reads
and writes (style.Margin[i], style.Position[i] = v) are unchanged.
Edge-resolution helpers take ReadOnlySpan<YogaValue>.

Buffers are explicitly seeded to YogaValue.Undefined / Auto in a new
constructor because default(YogaValue) is (0, Undefined) not the
(NaN, Undefined) sentinel — keeping == and Resolve() byte-identical.
InlineArray indexing is compiler-generated and AOT/trim-safe.

All 704 Yoga-filtered tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
LayoutResults allocated 7 float[] arrays (dimensions/measured/raw x2,
position/margin/border/padding x4) plus a CachedMeasurement[8] per node.
For a ~200-node tree that is ~1600 GC objects, a dominant part of the
Yoga memory gap vs the C++ original which uses inline members.

Replace them with fixed-size InlineArray structs (Float2, Float4,
CachedMeasurementArray) embedded directly in the heap object. The public
Get/Set accessors are byte-for-byte identical; CachedMeasurements is a
ref-returning property so in-place element mutation
(CachedMeasurements[idx].AvailableWidth = ...) is unchanged.

Cached-measurement slots are still seeded via new CachedMeasurement() in
the constructor/Reset because the struct's -1 field initializers do not
run for default-initialized inline-array elements. The NaN dimension
seeds are preserved explicitly. InlineArray indexing is AOT/trim-safe.

All 704 Yoga-filtered tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ApplyAttachedProperties read ~11 attached DPs per child per MeasureOverride
via GetValue, each boxing a double/enum. Attached DPs only change through the
property system (OnChildPropertyChanged), so snapshot the values in a per-child
AttachedProps cache and invalidate on that callback and on add/remove. Width/
Height/Margin/Visibility are still re-read each pass (they change without our
callback). No layout-numeric change; 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CalculateLayoutImpl and ComputeFlexBasisForChildren each allocated a new
List<YogaNode> per container per frame (2N allocs). Rent from the existing
FlexLineHelper thread-static pool and return at the single method exit.
Both methods have a single exit and recurse into fresh rented lists, so the
LIFO pool stays balanced. 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetLayoutChildren was a yield iterator that allocated a state-machine
enumerator on every call; it is used on hot per-frame paths (baseline,
trailing positions, wrap-reverse, pixel rounding). Return a value-type
LayoutChildren enumerable whose struct enumerator walks _children by index
with zero heap allocations in the common (no Display.Contents) case, falling
back to the allocating iterator only for a Contents subtree. Also iterate
RoundLayoutResultsToPixelGrid children by index instead of foreach over the
IReadOnlyList Children property (which boxed List<T>.Enumerator per node).
704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
IsBaselineLayout(node) was recomputed inside JustifyMainAxis for every flex
line, plus again at STEP 8 — each call walking all layout children. Its result
is invariant within a single layout pass (depends only on the node's
FlexDirection/AlignItems and children's AlignSelf/PositionType), so compute it
once in CalculateLayoutImpl and thread it into JustifyMainAxis / STEP 8. No
cross-frame cache, so no dirty-invalidation hazard. 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CalculateFlexLine allocated a new sealed FlexLine per flex line per frame.
Add a thread-static FlexLine pool; rent in CalculateFlexLine and return at the
per-line loop tail in CalculateLayoutImpl. The ItemsInFlow list travels with the
pooled FlexLine (cleared on rent) instead of being drawn from the node-list pool.
A FlexLine never escapes its loop iteration and recursion rents distinct lines,
so the LIFO pool stays balanced. 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The #138 setter equality guards stopped FlexPanel from re-dirtying every child each MeasureOverride. That re-dirtying used to reset Layout.ComputedFlexBasis to NaN (via MarkDirtyAndPropagate), forcing a fresh flex-basis computation each pass.

ComputeFlexBasisForChild only recomputed a cached basis when it was NaN (or under the off-by-default WebFlexBasis feature). With the guards, a flex child first measured in a min-content probe (undefined main axis, basis measured from CONTENT) then reused that stale content-basis in the definite-width pass where an explicit flex-basis:0 must apply, breaking flex-grow distribution in nested FlexPanels (Flex selftests FlexNested_RowInCol_ContentWidth / FlexNested_Deep_L2LeftWidth).

Fix: invalidate the resolved-basis cache when ComputedFlexBasisGeneration differs from the current generation, regardless of WebFlexBasis. Each top-level CalculateLayout bumps the generation, so this reproduces the old always-dirty basis freshness without re-dirtying the root (the frame-level layout cache still hits). All 704 Yoga fixtures unchanged; Flex selftest 0 failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ReturnFlexLine pushed the FlexLine to the thread-static pool without clearing its ItemsInFlow list, so each pooled slot pinned the previous flex line's YogaNode references (and the UI subtrees they reach) between layout passes until the slot was next rented. For a PR whose goal is cutting per-node memory, that is a regression.

Move the reset (ItemsInFlow.Clear + scalar fields) into ReturnFlexLine before pushing, matching ReturnList's clear-before-pool contract, and simplify RentFlexLine to pop-or-new. The only path into the pool is ReturnFlexLine, so a popped line is always clean at rent time. Found by the pr-review skill (correctness, high; multi-model confirmed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a pure-YogaNode unit test reproducing the nested flex-grow regression the #138 setter guards exposed: a flex-basis:0 grow item measured at content during an undefined-main-axis pass must not leak that content size into a later definite-width pass. Verified to FAIL (217 vs 200) without the per-generation basis invalidation in ComputeFlexBasisForChild and pass with it. The 704 single-generation fixtures never exercised this multi-generation path; the Flex selftest that originally caught it is outside the Yoga-only edit scope.

Found by the pr-review skill (test-coverage, medium).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…en enumerator (#145)

Copilot review flagged that the inner IEnumerator<YogaNode> used to flatten a Display.Contents subtree was dropped without Dispose() at natural completion, retaining the iterator's captured subtree references until the next GC. Dispose _inner at both completion points and implement IDisposable on the struct enumerator so a foreach that breaks out early mid-subtree (e.g. baseline detection in AlgorithmUtils) also releases it via the finally-Dispose. foreach calls Dispose on a struct enumerator that implements IDisposable as a non-boxing constrained call, so the common no-inner path stays allocation-free and AOT-safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The #147 attached-DP read cache is invalidated via OnChildPropertyChanged
(while the child is parented) and the SyncYogaTree removal sweep. Neither
fires when an attached DP changes while the child is detached from its panel
and the child is removed+re-added before the next measure: OnChildPropertyChanged
cannot reach the owning panel to drop the entry, and the removal sweep skips a
child that is present again. ApplyAttachedProperties then trusted a stale snapshot.

Flag such elements in a static weak-keyed ConditionalWeakTable from the detached
branch of OnChildPropertyChanged; ApplyAttachedProperties consults it only on a
cache hit and forces a re-read (clearing the marker). Steady-state cost is one
empty-CWT probe per cached child per sync, preserving the #147 win.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three Yoga unit tests surfaced by the pre-GO review gate:
- Setter_RealChange_MarksDirty now also covers aspectRatio (#138 real-change arm).
- Setter_AspectRatio_DegenerateNormalizesToAuto locks in the 0/Infinity->NaN
  normalization: re-applying a degenerate ratio is a no-op (NaN==NaN guard, so the
  layout cache is not defeated), while a genuine ratio change still dirties.
- FlexLineHelper_RentAndReturnFlexLine_ReusesAndResets asserts the #144 pooled
  FlexLine is reused AND fully reset on return (no stale scalars or pinned YogaNode
  refs leak across layout passes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The setter-guard and flex-basis-generation test sections (added earlier on this
branch) restarted the file's section numbering at 6/7 after section 12. Continue
the sequence as 13/14 so the section headings stay unique and navigable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azchohfi

Copy link
Copy Markdown
Collaborator Author

/perf

{
// Degenerate aspect ratios act as auto
_style.AspectRatio = (value == 0 || float.IsInfinity(value)) ? float.NaN : value;
float normalized = (value == 0 || float.IsInfinity(value)) ? float.NaN : value;
@github-actions

Copy link
Copy Markdown

⚡ Reactor perf comparison

Workload: StressPerf.ReactorOptimized StocksGrid · --percent 50 --duration 10 · x64 Release · median of 12 paired runs (2 warmup dropped); Δ is the mean change with a 95% CI · PR head and main built and run interleaved on the same runner.

Regression vs main baseline

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 3.65 3.67 +1.2% 95% CI [-2.6, +5.0] ≈ within noise
Avg Reconcile (ms) ↓ 101.4 101.3 -0.8% 95% CI [-3.8, +2.2] ≈ within noise
Avg Diff (ms) ↓ 90.4 90.4 -1.3% 95% CI [-4.2, +1.6] ≈ within noise
Avg Memory (MB) ↓ 289.4 289.7 -0.1% 95% CI [-0.7, +0.6] ≈ within noise

Low-mutation skip-floor (--percent 0)

At --percent 0 the workload mutates few cells per tick (always at least one), so reconcile/diff isolate the O(n) per-tick child skip-walk floor that higher mutation rates dilute — ChildReconciler re-walks every child each tick even when nothing moved. The closer --percent is to 0, the more this floor is the signal, so a structural-skip optimization shows up cleanly where the headline table above buries it. Δ is the mean paired change with a 95% CI.

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 16.55 16.15 -1.1% 95% CI [-6.3, +4.1] ≈ within noise
Avg Reconcile (ms) ↓ 31.8 29.9 -3.8% 95% CI [-10.1, +2.6] ≈ within noise
Avg Diff (ms) ↓ 29.8 28.1 -3.9% 95% CI [-10.6, +2.9] ≈ within noise
Avg Memory (MB) ↓ 268.8 269.4 +0.2% 95% CI [-0.1, +0.5] ≈ within noise

Allocation (Reactor) — lower is better

Metric main (baseline) This PR Δ (95% CI) Status
Alloc bytes/render ↓ 5699870 5713848 -0.1% 95% CI [-0.6, +0.4] ≈ within noise
Gen0 GC / 1k renders ↓ 223.61 223.61 +0.1% 95% CI [-3.8, +3.9] ≈ within noise

Keyed-list workload (StressPerf.KeyedList, --percent 50)

A separate macro workload: a ~500-row stably keyed list whose rows are reordered / inserted / removed each tick. Because every child carries a key, the child reconciler takes its keyed arm (ReconcileKeyedReconcileKeyedMiddle, the LIS-based minimal-move pass) instead of the positional re-walk the StocksGrid tables above measure — so this is the sensitive macro signal for keyed-diff work the positional cells can never reach. Same interleaved paired-Δ 95% CI as the headline table.

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 15.77 15.70 +0.9% 95% CI [-1.1, +2.9] ≈ within noise
Avg Reconcile (ms) ↓ 9.2 9.3 +1.0% 95% CI [-1.4, +3.5] ≈ within noise
Avg Diff (ms) ↓ 9.0 9.1 +0.8% 95% CI [-1.4, +3.1] ≈ within noise
Avg Memory (MB) ↓ 166.1 165.9 -0.1% 95% CI [-0.3, +0.1] ≈ within noise

Allocation (keyed-list) — lower is better

Metric main (baseline) This PR Δ (95% CI) Status
Alloc bytes/render ↓ 314092 314406 +0.1% 95% CI [-0.2, +0.4] ≈ within noise
Gen0 GC / 1k renders ↓ 18.63 18.87 +3.4% 95% CI [-9.5, +16.3] ≈ within noise

Reconciler micro-benchmarks (PerfBench.ControlModel)

Production --variant Reactor control-model path, ns-resolution and WinUI-undiluted (spec-047 M1–M13) — ↓ lower is better. Status tracks allocated bytes/op, the authoritative signal here; it is deterministic for structurally-fixed benches, while dispatcher / background-thread benches carry a small process-to-process offset, so a bench is flagged only when its 95% CI clears a ±3% minimum-effect band (real structural alloc changes are several percent to many-x). ns/op is shown for context but is not auto-flagged (its paired CI is rep-interleaved but the flag remains dormant pending a real-CI identical-binary band calibration). Δ is the mean paired change with a 95% CI.

Bench main ns/op Δ ns (95% CI) main B/op Δ alloc (95% CI) Status
M1 Mount_Leaf_NoCallback 93101.9 +2.3% 95% CI [-0.4, +5.0] 1140.9 0.0% 95% CI [0.0, 0.0] ≈ within noise
M2 Mount_Leaf_OneCallback 67702.3 0.0% 95% CI [-1.5, +1.5] 3383.3 0.0% 95% CI [0.0, 0.0] ≈ within noise
M3 Mount_Leaf_ThreeCallbacks 154486.9 -0.2% 95% CI [-3.7, +3.2] 8760.5 +0.1% 95% CI [-0.2, +0.3] ≈ within noise
M4 Dispatch_Switch_Cold 70896.7 -2.0% 95% CI [-4.2, +0.1] 1767.8 0.0% 95% CI [0.0, 0.0] ≈ within noise
M5 Dispatch_Switch_Warm 69693.1 -3.1% 95% CI [-6.5, +0.3] 1845.7 +0.8% 95% CI [-0.4, +1.9] ≈ within noise
M6 Dispatch_ExternalType 61927.7 0.0% 95% CI [-0.5, +0.5] 1069.6 +0.7% 95% CI [-1.9, +3.4] ≈ within noise
M7 Update_NoChange 37803.0 +0.1% 95% CI [-0.2, +0.4] 370.1 -1.2% 95% CI [-7.6, +5.3] ≈ within noise
M8 Update_OneLeafChanged 26948.0 -0.1% 95% CI [-1.0, +0.9] 536.0 0.0% 95% CI [0.0, 0.0] ≈ within noise
M9 Update_AllChanged 2495399.8 +0.5% 95% CI [-0.4, +1.4] 184278.1 0.0% 95% CI [0.0, 0.0] ≈ within noise
M10 EventHandlerState_Alloc 47602.1 +0.7% 95% CI [-1.5, +2.9] 3101.8 +0.2% 95% CI [-0.6, +1.1] ≈ within noise
M11 ModifierEHS_Frequency 30971.6 +1.9% 95% CI [+0.5, +3.4] 640.5 -0.1% 95% CI [-0.5, +0.4] ≈ within noise
M12 Pool_Rent_HotPath 74600.6 -1.5% 95% CI [-3.3, +0.2] 1099.9 0.0% 95% CI [0.0, 0.0] ≈ within noise
M13 Setters_Suppression_Scope 93.8 +8.9% 95% CI [-1.1, +18.9] 26.7 0.0% 95% CI [0.0, 0.0] ≈ within noise
M14 Dsl_Rebuild_Cascade 1279208.4 +0.5% 95% CI [-1.6, +2.7] 2231828.9 0.0% 95% CI [0.0, 0.0] ≈ within noise
C207 ChangeHandler_DpRead_Coalesce 898.7 -0.1% 95% CI [-6.1, +6.0] 0.6 0.0% 95% CI [0.0, 0.0] ≈ within noise
OAlloc Optional_Element_Alloc 206.2 +2.3% 95% CI [-3.3, +7.9] 528.0 0.0% 95% CI [0.0, 0.0] ≈ within noise
OUpdate Optional_Reconciler_Update 9522.6 +0.1% 95% CI [-1.0, +1.1] 2772.3 0.0% 95% CI [0.0, 0.0] ≈ within noise

Cross-framework reference (same StocksGrid workload)

Metric vanilla WinUI3¹ Rust windows-reactor² Reactor (this PR)
Renders/sec ↑ 4.58 6.34 3.67
Avg Reconcile (ms) ↓ n/a 18.9 101.3
Avg Diff (ms) ↓ n/a 17.1 90.4
Avg Memory (MB) ↓ 262.8 193.0 289.7

↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement / ⚠️ regression require the CI to exclude 0.
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
Reconciler micro-benchmarks run PerfBench.ControlModel --variant Reactor (M1–M13) as a headless loop bracketed by per-thread alloc + GC counters — ns-resolution and free of WinUI render / working-set dilution, so they resolve Core/Reconciler allocation deltas the macro StocksGrid workload cannot. main and PR each link their own src/Reactor build and are rep-interleaved (a fresh alternated process per rep); Δ is the paired 95% CI over per-rep means. The Status column tracks allocated bytes/op (deterministic for identical code); ns/op is informational — its paired CI is now unbiased but the flag stays dormant pending a real-CI identical-binary band calibration.
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: AMD EPYC 9V74 80-Core Processor · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1043000781.
Generated by .github/workflows/perf-compare.yml · PR e4194f4 vs main e8572a0 · 2026-06-27T05:35:15Z · run log.

@azchohfi azchohfi closed this Jun 27, 2026
@azchohfi azchohfi deleted the sweep-670 branch June 27, 2026 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant