Skip to content

perf: Yoga layout-cache guards + inline per-node arrays#670

Draft
azchohfi wants to merge 15 commits into
mainfrom
azchohfi-yoga-layout-memory-perf
Draft

perf: Yoga layout-cache guards + inline per-node arrays#670
azchohfi wants to merge 15 commits into
mainfrom
azchohfi-yoga-layout-memory-perf

Conversation

@azchohfi

@azchohfi azchohfi commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Closes #658

Summary

Targets the Yoga layout engine (src/Reactor/Yoga/ only) — implicated in both the Renders/sec gap (layout cache never hit → full re-layout every frame) and the ~1.5x Memory gap (per-node arrays where the C++ original uses inline members) on the StocksGrid data-grid stress workload.

Fixes

Flagship — renders (layout cache):

  • chore(deps): bump Microsoft.WindowsAppSDK 2.0.0-preview2 → 2.0.1 #138 YogaNode setter equality guards (if (current == value) return;) on FlexDirection/Justify/Align/Width/Height/Min/margins/… so SetRootConstraints + ApplyAttachedProperties re-setting unchanged values each frame no longer re-dirties the root. This is what lets the frame-level layout cache actually hit for stable cells. Unit tests assert no-op sets don't dirty and real changes do (incl. the AspectRatio degenerate-0/Infinityauto normalization).

Flagship — memory:

Supporting allocations / boxing:

Regression fix (caught by Flex selftest):

  • The chore(deps): bump Microsoft.WindowsAppSDK 2.0.0-preview2 → 2.0.1 #138 guards stopped FlexPanel re-dirtying every child each MeasureOverride; that re-dirty used to reset Layout.ComputedFlexBasis = NaN. ComputeFlexBasisForChild only recomputed a cached basis when it was NaN (or under the off-by-default WebFlexBasis flag), so a child first measured in a min-content probe (undefined main axis → basis from content) reused that stale content-basis in the later definite-width pass where flex-basis:0 must apply — breaking flex-grow distribution in nested FlexPanels. Fix: invalidate the resolved-basis cache when ComputedFlexBasisGeneration differs from the current generation, regardless of WebFlexBasis. Each top-level CalculateLayout bumps the generation, so this restores the old always-fresh basis without re-dirtying the root (the layout cache still hits).

Deferred (intentionally, for correctness)

Validation

  • Yoga fixtures (--filter ~Yoga): 708 passed, 0 failed, 46 skipped — encode Yoga's reference behavior; all preserved.
  • Full tests/Reactor.Tests: 9684 passed, 0 failed, 64 skipped.
  • Flex selftest (--self-test --filter Flex): 0 failures, including the two nested flex-grow fixtures the regression fix restores.
  • dotnet build src/Reactor/Reactor.csproj -c Release: core Reactor.dll builds with 0 warnings / 0 errors (the core lib treats AOT/trim warnings as errors, so the InlineArray/Unsafe/ConditionalWeakTable usage is confirmed AOT-safe).

Layout correctness is paramount — no fixture numeric results changed.

azchohfi and others added 9 commits June 25, 2026 12:16
YogaNode property setters (FlexDirection/Justify/Align/Width/Height/
Min/Max/margins/padding/border/position/gap/...) called
MarkDirtyAndPropagate unconditionally. FlexPanel re-applies the same
container and child style every MeasureOverride, so the root and every
cell were re-dirtied each frame and the Yoga layout cache never hit.

Add 'if (current == value) return;' guards to every setter so unchanged
values do not dirty the tree, mirroring upstream Yoga's updateStyle.
Route FlexPanel.ApplyAttachedProperties direct node.Style writes
(FlexGrow/FlexShrink/FlexBasis/AlignSelf/PositionType/Position) through
the guarded setters so a genuinely changed flex item still relayouts.

Add unit tests asserting same-value assignment does not dirty, a real
change does, and that re-applying identical style keeps the layout
cache hot (child measure function not re-invoked).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
)

YogaStyle allocated 8 separate YogaValue[] arrays per instance
(Margin/Position/Padding/Border x9, Gap x3, Dimensions/Min/Max x2).
For a ~200-node tree that is ~1600 GC objects, a dominant Yoga memory
regression vs the C++ original which uses inline members.

Replace the arrays with fixed-size InlineArray structs (EdgeValues,
GutterValues, DimensionValues) embedded directly in the YogaStyle heap
object. Exposed as ref-returning properties so existing indexer reads
and writes (style.Margin[i], style.Position[i] = v) are unchanged.
Edge-resolution helpers take ReadOnlySpan<YogaValue>.

Buffers are explicitly seeded to YogaValue.Undefined / Auto in a new
constructor because default(YogaValue) is (0, Undefined) not the
(NaN, Undefined) sentinel — keeping == and Resolve() byte-identical.
InlineArray indexing is compiler-generated and AOT/trim-safe.

All 704 Yoga-filtered tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
LayoutResults allocated 7 float[] arrays (dimensions/measured/raw x2,
position/margin/border/padding x4) plus a CachedMeasurement[8] per node.
For a ~200-node tree that is ~1600 GC objects, a dominant part of the
Yoga memory gap vs the C++ original which uses inline members.

Replace them with fixed-size InlineArray structs (Float2, Float4,
CachedMeasurementArray) embedded directly in the heap object. The public
Get/Set accessors are byte-for-byte identical; CachedMeasurements is a
ref-returning property so in-place element mutation
(CachedMeasurements[idx].AvailableWidth = ...) is unchanged.

Cached-measurement slots are still seeded via new CachedMeasurement() in
the constructor/Reset because the struct's -1 field initializers do not
run for default-initialized inline-array elements. The NaN dimension
seeds are preserved explicitly. InlineArray indexing is AOT/trim-safe.

All 704 Yoga-filtered tests pass unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ApplyAttachedProperties read ~11 attached DPs per child per MeasureOverride
via GetValue, each boxing a double/enum. Attached DPs only change through the
property system (OnChildPropertyChanged), so snapshot the values in a per-child
AttachedProps cache and invalidate on that callback and on add/remove. Width/
Height/Margin/Visibility are still re-read each pass (they change without our
callback). No layout-numeric change; 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CalculateLayoutImpl and ComputeFlexBasisForChildren each allocated a new
List<YogaNode> per container per frame (2N allocs). Rent from the existing
FlexLineHelper thread-static pool and return at the single method exit.
Both methods have a single exit and recurse into fresh rented lists, so the
LIFO pool stays balanced. 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GetLayoutChildren was a yield iterator that allocated a state-machine
enumerator on every call; it is used on hot per-frame paths (baseline,
trailing positions, wrap-reverse, pixel rounding). Return a value-type
LayoutChildren enumerable whose struct enumerator walks _children by index
with zero heap allocations in the common (no Display.Contents) case, falling
back to the allocating iterator only for a Contents subtree. Also iterate
RoundLayoutResultsToPixelGrid children by index instead of foreach over the
IReadOnlyList Children property (which boxed List<T>.Enumerator per node).
704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
IsBaselineLayout(node) was recomputed inside JustifyMainAxis for every flex
line, plus again at STEP 8 — each call walking all layout children. Its result
is invariant within a single layout pass (depends only on the node's
FlexDirection/AlignItems and children's AlignSelf/PositionType), so compute it
once in CalculateLayoutImpl and thread it into JustifyMainAxis / STEP 8. No
cross-frame cache, so no dirty-invalidation hazard. 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CalculateFlexLine allocated a new sealed FlexLine per flex line per frame.
Add a thread-static FlexLine pool; rent in CalculateFlexLine and return at the
per-line loop tail in CalculateLayoutImpl. The ItemsInFlow list travels with the
pooled FlexLine (cleared on rent) instead of being drawn from the node-list pool.
A FlexLine never escapes its loop iteration and recursion rents distinct lines,
so the LIFO pool stays balanced. 704 Yoga fixtures pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The #138 setter equality guards stopped FlexPanel from re-dirtying every child each MeasureOverride. That re-dirtying used to reset Layout.ComputedFlexBasis to NaN (via MarkDirtyAndPropagate), forcing a fresh flex-basis computation each pass.

ComputeFlexBasisForChild only recomputed a cached basis when it was NaN (or under the off-by-default WebFlexBasis feature). With the guards, a flex child first measured in a min-content probe (undefined main axis, basis measured from CONTENT) then reused that stale content-basis in the definite-width pass where an explicit flex-basis:0 must apply, breaking flex-grow distribution in nested FlexPanels (Flex selftests FlexNested_RowInCol_ContentWidth / FlexNested_Deep_L2LeftWidth).

Fix: invalidate the resolved-basis cache when ComputedFlexBasisGeneration differs from the current generation, regardless of WebFlexBasis. Each top-level CalculateLayout bumps the generation, so this reproduces the old always-dirty basis freshness without re-dirtying the root (the frame-level layout cache still hits). All 704 Yoga fixtures unchanged; Flex selftest 0 failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azchohfi azchohfi marked this pull request as draft June 25, 2026 20:57
Comment thread src/Reactor/Yoga/YogaNode.cs
azchohfi and others added 2 commits June 25, 2026 14:27
ReturnFlexLine pushed the FlexLine to the thread-static pool without clearing its ItemsInFlow list, so each pooled slot pinned the previous flex line's YogaNode references (and the UI subtrees they reach) between layout passes until the slot was next rented. For a PR whose goal is cutting per-node memory, that is a regression.

Move the reset (ItemsInFlow.Clear + scalar fields) into ReturnFlexLine before pushing, matching ReturnList's clear-before-pool contract, and simplify RentFlexLine to pop-or-new. The only path into the pool is ReturnFlexLine, so a popped line is always clean at rent time. Found by the pr-review skill (correctness, high; multi-model confirmed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a pure-YogaNode unit test reproducing the nested flex-grow regression the #138 setter guards exposed: a flex-basis:0 grow item measured at content during an undefined-main-axis pass must not leak that content size into a later definite-width pass. Verified to FAIL (217 vs 200) without the per-generation basis invalidation in ComputeFlexBasisForChild and pass with it. The 704 single-generation fixtures never exercised this multi-generation path; the Flex selftest that originally caught it is outside the Yoga-only edit scope.

Found by the pr-review skill (test-coverage, medium).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets the Yoga layout engine (src/Reactor/Yoga/) to improve frame-to-frame layout-cache hit rates (by preventing no-op style re-dirtying) and reduce per-node memory/GC pressure (by replacing per-node arrays with [InlineArray] inline buffers), with additional hot-path allocation reductions via pooling and iteration changes.

Changes:

  • Add equality guards to YogaNode style setters so re-applying identical style does not mark nodes dirty, enabling effective layout caching; add unit tests covering no-op sets vs real changes.
  • Replace multiple per-node heap arrays in YogaStyle and LayoutResults with inline [InlineArray] buffers while keeping accessor semantics consistent.
  • Reduce per-layout allocations by pooling child lists / FlexLine, and by switching hot traversals away from allocating enumerators.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/Reactor.Tests/YogaEdgeCaseTests.cs Adds tests asserting no-op setter writes don’t dirty nodes and that layout cache stays hot across repeated layouts.
src/Reactor/Yoga/YogaStyle.cs Replaces edge/gutter/dimension YogaValue[] arrays with inline buffers and updates edge computation helpers.
src/Reactor/Yoga/YogaNode.cs Adds equality-guarded setters and introduces an allocation-free GetLayoutChildren() enumerator path.
src/Reactor/Yoga/YogaAlgorithm.cs Uses pooled lists for layout child collection and computes baseline-layout flag once per layout pass; fixes flex-basis cache invalidation across generations.
src/Reactor/Yoga/LayoutResults.cs Replaces multiple float arrays + cached-measurement array with inline buffers; updates reset/initialization accordingly.
src/Reactor/Yoga/FlexPanel.cs Adds per-child attached-DP caching and routes updates through YogaNode setters to avoid bypassing dirty tracking.
src/Reactor/Yoga/AlgorithmUtils.cs Avoids per-node enumerator boxing in pixel-rounding traversal and adds FlexLine pooling support.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/Reactor/Yoga/YogaNode.cs
Comment thread src/Reactor/Yoga/YogaNode.cs
azchohfi and others added 3 commits June 25, 2026 14:46
…en enumerator (#145)

Copilot review flagged that the inner IEnumerator<YogaNode> used to flatten a Display.Contents subtree was dropped without Dispose() at natural completion, retaining the iterator's captured subtree references until the next GC. Dispose _inner at both completion points and implement IDisposable on the struct enumerator so a foreach that breaks out early mid-subtree (e.g. baseline detection in AlgorithmUtils) also releases it via the finally-Dispose. foreach calls Dispose on a struct enumerator that implements IDisposable as a non-boxing constrained call, so the common no-inner path stays allocation-free and AOT-safe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The #147 attached-DP read cache is invalidated via OnChildPropertyChanged
(while the child is parented) and the SyncYogaTree removal sweep. Neither
fires when an attached DP changes while the child is detached from its panel
and the child is removed+re-added before the next measure: OnChildPropertyChanged
cannot reach the owning panel to drop the entry, and the removal sweep skips a
child that is present again. ApplyAttachedProperties then trusted a stale snapshot.

Flag such elements in a static weak-keyed ConditionalWeakTable from the detached
branch of OnChildPropertyChanged; ApplyAttachedProperties consults it only on a
cache hit and forces a re-read (clearing the marker). Steady-state cost is one
empty-CWT probe per cached child per sync, preserving the #147 win.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add three Yoga unit tests surfaced by the pre-GO review gate:
- Setter_RealChange_MarksDirty now also covers aspectRatio (#138 real-change arm).
- Setter_AspectRatio_DegenerateNormalizesToAuto locks in the 0/Infinity->NaN
  normalization: re-applying a degenerate ratio is a no-op (NaN==NaN guard, so the
  layout cache is not defeated), while a genuine ratio change still dirties.
- FlexLineHelper_RentAndReturnFlexLine_ReusesAndResets asserts the #144 pooled
  FlexLine is reused AND fully reset on return (no stale scalars or pinned YogaNode
  refs leak across layout passes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment thread tests/Reactor.Tests/YogaEdgeCaseTests.cs
Comment thread tests/Reactor.Tests/YogaEdgeCaseTests.cs
The setter-guard and flex-basis-generation test sections (added earlier on this
branch) restarted the file's section numbering at 6/7 after section 12. Continue
the sequence as 13/14 so the section headings stay unique and navigable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Comment thread src/Reactor/Yoga/FlexPanel.cs
@azchohfi

Copy link
Copy Markdown
Collaborator Author

/perf

@github-actions

Copy link
Copy Markdown

⚡ Reactor perf comparison

Workload: StressPerf.ReactorOptimized StocksGrid · --percent 50 --duration 10 · x64 Release · median of 12 paired runs (2 warmup dropped); Δ is the mean change with a 95% CI · PR head and main built and run interleaved on the same runner.

Regression vs main baseline

Metric main (baseline) This PR Δ (95% CI) Status
Renders/sec ↑ 2.62 2.48 -3.6% 95% CI [-8.0, +0.9] ≈ within noise
Avg Reconcile (ms) ↓ 135.9 145.4 +4.4% 95% CI [-0.2, +8.9] ≈ within noise
Avg Diff (ms) ↓ 125.3 134.8 +4.8% 95% CI [-0.1, +9.6] ≈ within noise
Avg Memory (MB) ↓ 295.3 289.1 -1.6% 95% CI [-2.8, -0.5] ✅ improvement

Allocation (Reactor) — lower is better

Metric main (baseline) This PR Δ (95% CI) Status
Alloc bytes/render ↓ 9542343 n/a
Gen0 GC / 1k renders ↓ 269.23 n/a

Cross-framework reference (same StocksGrid workload)

Metric vanilla WinUI3¹ Rust windows-reactor² Reactor (this PR)
Renders/sec ↑ 3.24 4.95 2.48
Avg Reconcile (ms) ↓ n/a 18.8 145.4
Avg Diff (ms) ↓ n/a 17.0 134.8
Avg Memory (MB) ↓ 264.5 197.4 289.1

↑ higher is better · ↓ lower is better. Within noise = the 95% confidence interval of the paired Δ includes 0 (no change resolvable at this sample size); ✅ improvement / ⚠️ regression require the CI to exclude 0.
Allocation metrics (alloc bytes/render, Gen0 GC) are the sensitive signal for allocation-reduction work, where the mean-ms / memory figures are largely flat. They read n/a for a harness built from a revision that predates them (rebase the PR onto main to populate them).
¹ vanilla WinUI3 = StressPerf.Direct (imperative; no virtual-DOM, so it has no reconcile/diff phase — those cells read n/a). Measured live on this runner.
² Rust = test_reactor_perf from microsoft/windows-rs — a port of this harness (same StocksGrid, same --percent/--duration CLI). Built from source and measured live on this runner.
Absolute numbers are runner-dependent — trust the Δ vs main, not the absolute values. Memory (working set) is the noisiest metric.
Runner: CPU: AMD EPYC 7763 64-Core Processor · 4 logical cores · 16 GB RAM · runner: GitHub Actions 1042818410.
Generated by .github/workflows/perf-compare.yml · PR 1323999 vs main 66d38dc · 2026-06-26T06:57:45Z · run log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: Yoga layout-cache guards + inline per-node arrays

2 participants