[NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment#127853
[NO-REVIEW] [NO-MERGE] Auto loop vectorization experiment#127853hez2010 wants to merge 35 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces a new JIT auto-vectorization optimization pass that analyzes and rewrites qualifying loops into SIMD vector loops (with scalar epilogues), along with associated config knobs, phase plumbing, build integration, and perf metrics.
Changes:
- Add
AutoVectorizerimplementation and integrate it as a new compilation phase. - Introduce new JIT config flags to control auto-vectorization and an “aggressive” mode.
- Add a new JIT metadata metric to track the number of loops vectorized.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/coreclr/jit/jitmetadatalist.h | Adds a new LoopsVectorized metric to track vectorized loops. |
| src/coreclr/jit/jitconfigvalues.h | Adds config switches to enable/disable auto-vectorization and aggressive vectorizing. |
| src/coreclr/jit/compphases.h | Registers a new PHASE_AUTO_VECTORIZATION phase name. |
| src/coreclr/jit/compiler.h | Adds optAutoVectorize() and grants the new pass friend access. |
| src/coreclr/jit/compiler.cpp | Wires the new phase into the pipeline when optimizations are enabled. |
| src/coreclr/jit/autovectorizer.h | Declares the AutoVectorizer pass and its planning/rewriting machinery. |
| src/coreclr/jit/autovectorizer.cpp | Implements loop analysis, SLP planning, profitability heuristics, and CFG/IR rewrite. |
| src/coreclr/jit/CMakeLists.txt | Adds the new source/header to the JIT build. |
| GenTree* AutoVectorizer::BuildVectorReductionOp(LoopVectorizationPlan* plan, | ||
| const LoopVectorizationPlan::ReductionInfo& reduction, | ||
| GenTree* op1, | ||
| GenTree* op2) | ||
| { | ||
| #if defined(FEATURE_HW_INTRINSICS) && (defined(TARGET_XARCH) || defined(TARGET_ARM64)) | ||
| const var_types simdType = Compiler::getSIMDTypeForSize(plan->VectorSizeBytes); | ||
| if (reduction.Oper != GT_INTRINSIC) | ||
| { | ||
| return m_compiler->gtNewSimdBinOpNode(GT_ADD, simdType, op1, op2, plan->ElementType, plan->VectorSizeBytes); | ||
| } | ||
|
|
||
| return BuildVectorMinMaxOp(reduction, op1, op2, simdType, plan->VectorSizeBytes); | ||
| #else | ||
| unreached(); | ||
| #endif | ||
| } |
| for (unsigned i = 0; i < plan->LoadCount; i++) | ||
| { | ||
| const LoopVectorizationPlan::ScalarAccess& existing = plan->LoadAccesses[i]; | ||
| if ((existing.Address == access.Address) || | ||
| ((existing.BaseLocalIfKnown == access.BaseLocalIfKnown) && | ||
| (existing.OffsetLocalIfKnown == access.OffsetLocalIfKnown) && | ||
| (existing.IndexOffset == access.IndexOffset) && (existing.PostIVOffset == access.PostIVOffset) && | ||
| (existing.ElementType == access.ElementType) && (existing.IsArray == access.IsArray) && | ||
| (existing.IsByrefLocal == access.IsByrefLocal) && | ||
| (existing.IsByrefBaseWithOffset == access.IsByrefBaseWithOffset) && | ||
| (existing.IsByrefWithIndex == access.IsByrefWithIndex))) | ||
| { | ||
| *index = i; | ||
| return true; | ||
| } | ||
| } |
| CONFIG_STRING(JitObjectStackAllocationTrackFieldsRange, "JitObjectStackAllocationTrackFieldsRange") | ||
| CONFIG_INTEGER(JitObjectStackAllocationDumpConnGraph, "JitObjectStackAllocationDumpConnGraph", 0) | ||
|
|
||
| RELEASE_CONFIG_INTEGER(JitAutoVectorization, "JitAutoVectorization", 1) |
| class AutoVectorizer | ||
| { | ||
| public: | ||
| explicit AutoVectorizer(Compiler* compiler); |
| if (first.IsArray && second.IsArray) | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) && | ||
| (second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex)) | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| // Array and byref/span bases can still describe the same storage after morphing. |
| if (doAutoVectorization) | ||
| { | ||
| // Rewrite HIR loops late, after VN-DSE and if-conversion but before rationalization. | ||
| // | ||
| DoPhase(this, PHASE_AUTO_VECTORIZATION, &Compiler::optAutoVectorize); | ||
| } |
| const var_types simdType = Compiler::getSIMDTypeForSize(plan->VectorSizeBytes); | ||
| if (reduction.Oper != GT_INTRINSIC) | ||
| { | ||
| return m_compiler->gtNewSimdBinOpNode(GT_ADD, simdType, op1, op2, plan->ElementType, plan->VectorSizeBytes); |
| if (tree->OperIs(GT_ADD)) | ||
| { | ||
| GenTree* op1 = tree->AsOp()->gtOp1; | ||
| GenTree* op2 = tree->AsOp()->gtOp2; | ||
|
|
||
| if (op1->IsCnsIntOrI()) | ||
| { | ||
| *offset += static_cast<int>(op1->AsIntConCommon()->IconValue()); | ||
| return TryAnalyzeIndexExpr(plan, op2, ivLcl, offset, invariantLcl, sawIv, depth + 1); | ||
| } | ||
|
|
||
| if (op2->IsCnsIntOrI()) | ||
| { | ||
| *offset += static_cast<int>(op2->AsIntConCommon()->IconValue()); | ||
| return TryAnalyzeIndexExpr(plan, op1, ivLcl, offset, invariantLcl, sawIv, depth + 1); | ||
| } |
| LclVarDsc* const ivDsc = m_compiler->lvaGetDesc(plan->InductionVar); | ||
| GenTree* iv = m_compiler->gtNewLclvNode(plan->InductionVar, ivDsc->TypeGet()); | ||
| GenTree* end = m_compiler->gtCloneExpr(plan->End); | ||
|
|
||
| GenTree* lastLane = m_compiler->gtNewCastNode(TYP_LONG, iv, false, TYP_LONG); | ||
| if (plan->VectorizationFactor > 1) | ||
| { | ||
| lastLane = | ||
| m_compiler->gtNewOperNode(plan->Step < 0 ? GT_SUB : GT_ADD, TYP_LONG, lastLane, | ||
| m_compiler->gtNewLconNode(static_cast<int64_t>(plan->VectorizationFactor - 1))); | ||
| } | ||
|
|
||
| end = m_compiler->gtNewCastNode(TYP_LONG, end, false, TYP_LONG); |
| if (first.IsArray && second.IsArray) | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) && | ||
| (second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex)) | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| // Array and byref/span bases can still describe the same storage after morphing. |
| BasicBlock* const header = loop->GetHeader(); | ||
| bool alreadyRewritten = false; | ||
| for (unsigned rewrittenHeader : rewrittenHeaders) | ||
| { | ||
| if (rewrittenHeader == header->bbNum) | ||
| { | ||
| alreadyRewritten = true; | ||
| break; | ||
| } | ||
| } |
| if (doAutoVectorization) | ||
| { | ||
| // Rewrite HIR loops late, after VN-DSE and if-conversion but before rationalization. | ||
| // | ||
| DoPhase(this, PHASE_AUTO_VECTORIZATION, &Compiler::optAutoVectorize); | ||
| } |
|
|
||
| if (!changed) | ||
| { | ||
| m_compiler->fgInvalidateDfsTree(); |
| const LoopVectorizationPlan originalPlan = *plan; | ||
|
|
||
| for (unsigned i = 0; i < vectorSizeCount; i++) | ||
| { | ||
| *plan = originalPlan; | ||
|
|
||
| plan->VectorSizeBytes = vectorSizes[i]; |
| if (first.IsArray && second.IsArray) | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| if ((first.IsByrefLocal || first.IsByrefBaseWithOffset || first.IsByrefWithIndex) && | ||
| (second.IsByrefLocal || second.IsByrefBaseWithOffset || second.IsByrefWithIndex)) | ||
| { | ||
| return true; | ||
| } | ||
|
|
||
| // Array and byref/span bases can still describe the same storage after morphing. |
| BasicBlock* const header = loop->GetHeader(); | ||
| bool alreadyRewritten = false; | ||
| for (unsigned rewrittenHeader : rewrittenHeaders) | ||
| { | ||
| if (rewrittenHeader == header->bbNum) | ||
| { | ||
| alreadyRewritten = true; | ||
| break; | ||
| } | ||
| } |
| CONFIG_STRING(JitObjectStackAllocationTrackFieldsRange, "JitObjectStackAllocationTrackFieldsRange") | ||
| CONFIG_INTEGER(JitObjectStackAllocationDumpConnGraph, "JitObjectStackAllocationDumpConnGraph", 0) | ||
|
|
||
| RELEASE_CONFIG_INTEGER(JitAutoVectorization, "JitAutoVectorization", 1) |
|
pmi on S.P.CoreLib and framework assemblies: |
|
CoreLib and framework assemblies full diffs: Method lists (potential candidates for us to vectorize them in the BCL): It seems there're some interesting spots that can be manually vectorized in tensor and regex libraries. |
|
The final TP impact seems to be +0.12% to +0.38% for fullopts. |
|
Closing as I've got everything I was curious about in this experiment. |
Note
This is a fully vibe coded experiment with neither careful correctness review nor extensive test.
It's not aiming for reviewing or merging. I'm opening this PR to evaluate its actual impact and also aiming for finding potential vectorization opportunities within BCL.
Local SPMI Run Study
Headline:
Default policy has profitability checks for opportunity analysis, and it will choose vector width based on pressure; aggressive policy bypasses the checks and always uses the maximum available vector size.
Complete report with asm diffs:
autovec-binary-release-asm-metrics-report.md
Artifacts including spmi logs and per-method diffs:
autovec-binary-release-artifacts-with-dasm.zip
cc: @dotnet/jit-contrib
Note
The following content is AI generated.
Summary
This change adds a late HIR auto-vectorization phase to RyuJIT. The phase recognizes profitable counted loops, builds a virtual-lane SLP plan from the scalar loop body, and rewrites the loop into a vector loop plus scalar epilogue. The generated IR uses existing SIMD/HW intrinsic nodes so rationalization, lowering, LSRA, and codegen continue to own target-specific SIMD expansion.
The vectorizer is enabled by default via
JitAutoVectorization=1. A second knob,JitAggressiveVectorizing=1, bypasses the profitability policy for investigation and opportunity measurement.Phase Placement
The new phase is wired as:
More concretely,
PHASE_AUTO_VECTORIZATIONruns after VN-based dead-store removal and if-conversion, and before pre-layout flow opts and rationalization.This placement is intentional:
GT_SELECTand be packed by SLP.BasicBlock/Statement/GenTreeform and can be rewritten structurally.After rewriting, the phase marks loop/flow/liveness-sensitive state stale and relies on the normal downstream pipeline to repair/consume the resulting HIR.
Design
The implementation is centered on
AutoVectorizerinjit/autovectorizer.cpp.The core pipeline is:
Metrics.LoopsVectorized.The SLP planner does not materialize scalar unrolling in HIR. Instead, it reasons about virtual lanes and directly emits vector IR for the accepted pack:
This keeps unsuccessful candidates cheap and avoids expanding scalar IR just to discover that the loop is not vectorizable.
Supported Targets and Width Selection
The phase is enabled for optimized, non-debuggable compilations on SIMD-capable xarch and arm64 targets.
Vector width selection uses the maximum hardware-supported SIMD width for the selected element type, subject to the profitability policy:
The policy considers estimated scalar/vector cost, loop overhead, constant trip count, block hotness, simple memory-loop shape, vector pressure, reduction presence, and code size.
JitAggressiveVectorizing=1bypasses this policy and selects the first legal vector width, which is useful for finding missed opportunities and comparing the production policy against the legal maximum.Covered Loop Shapes
The vectorizer currently handles conservative natural-loop forms:
<,<=,>,>=, and selected!=counted-loop tests,The phase deliberately rejects unsupported or risky CFG shapes such as EH loops, non-innermost loops, multi-exit loops, and
==loop termination.Covered Memory Forms
The memory analysis supports contiguous element access through:
The vectorizer rejects volatile accesses, unsupported element types, remaining unproven bounds checks, unsupported address expressions, and dependence patterns that could change scalar semantics.
Covered Element Types and Operations
Supported element types include the primitive SIMD element types handled by the existing SIMD/HW intrinsic path, including integral and floating-point element types.
The SLP planner covers:
GT_SELECT,Reduction support includes vector accumulator setup, vector loop update, and scalar finalization. The implementation supports add/sub reductions and min/max-style reductions for supported element types, including floating-point reduction paths where the scalar semantics are represented by the recognized intrinsic pattern.
Unsupported forms are still rejected rather than guessed: non-contiguous/gather/scatter memory, arbitrary casts and widening/narrowing packs, modulo, unsupported division forms, unsupported helper/call shapes, complicated address expressions, and control flow that was not simplified into supported straight-line HIR.
Safety Model
The implementation is intentionally conservative. It rejects a candidate unless legality is clear.
Important safety rules include:
The rewrite preserves the original scalar loop for the tail and redirects control flow through the new vector loop only when the vector trip count and alias checks allow it.
Diagnostics and Metrics
The phase uses normal JitDump output. Dumps include:
This change also adds a new JIT metric
LoopsVectorized.The metric increments once per successfully rewritten loop and can be used by SuperPMI metricdiff to measure vectorization coverage per collection/method.
Files Changed
jit/autovectorizer.cppjit/autovectorizer.hjit/compiler.cppjit/compiler.hjit/compphases.hjit/jitconfigvalues.hjit/jitmetadatalist.hjit/CMakeLists.txtValidation
clr.jitRelease withNoPgoOptimize=true.LoopsVectorizedmetrics for the final report.