@turbo loses last cleanup iteration with strided contiguous load when out_i mod W != 0 (Apple ARM)

## Reproducer (Apple M-series, Julia 1.11.8, LoopVectorization master, VectorizationBase 0.21.72)

\`\`\`julia
using LoopVectorization

function f!(out, arr)
  @turbo for j in axes(out, 2), i in axes(out, 1)
    out[i, j] = arr[2i, 2j]
  end
end

A = rand(6, 2)
out = fill(NaN, 3, 1)   # 3 i-iterations, 1 j-iteration
f!(out, A)
@show out                 # last entry remains NaN — the loop never wrote it
\`\`\`

The strided load on the contiguous axis (\`arr[2i, ...]\`) combined with \`@turbo\`'s default unroll on \`:i\` makes the cleanup tail skip the final iteration(s) on Apple ARM.

## Pattern

The number of dropped trailing \`out[i, j]\` writes is roughly \`out_i mod (unroll_factor * W)\` when that's < unroll_factor*W and > 0. Concretely, on aarch64+Apple (NEON 128-bit):

**Float64 (W=2)**: bug when \`out_i\` is odd and ≥ 3 — last 1 iteration dropped.

| M | out_i | last entry written? |
|---|-------|---------------------|
| 4..5 | 2 | yes |
| 6..7 | 3 | **NO** (last entry stays at NaN) |
| 8..9 | 4 | yes |
| 10..11 | 5 | **NO** |
| 12..13 | 6 | yes |
| 14..15 | 7 | **NO** |

**Float32 (W=4)**: bug when \`out_i mod 4 ∈ {1, 2, 3}\` and out_i ≥ 4 — last \`out_i mod 4\` iterations dropped.

| M | out_i | dropped |
|---|-------|---------|
| 8..9 | 4 | 0 |
| 10..11 | 5 | 1 |
| 12..13 | 6 | 2 |
| 14..15 | 7 | 3 |
| 16..17 | 8 | 0 |
| 18..19 | 9 | 1 |
| 20..21 | 10 | 2 |
| 22..23 | 11 | 3 |

## Trigger conditions

- Apple aarch64 (NEON 128-bit register width).
- An axis with a **strided** load on the contiguous dimension, e.g. \`arr[2i, ...]\` or \`arr[3i, ...]\` or \`arr[2i-1, ...]\`.
- LV's default unroll on that axis (\`@turbo unroll=1\` and \`@turbo unroll=(1,1)\` both make the bug go away).
- Independent of the other axis: holds whether \`:j\` has 1 or 40 iterations, whether the body has 1 term or 4.

The non-strided baseline \`out[i,j] = arr[i, j]\` does **not** trigger the bug — only strided loads on the unrolled axis do.

## Test gating

This is the cluster of \`@test_broken\`s in \`test/shuffleloadstores.jl\` around line 494 (the \`tullio_issue_131\` pattern with \`(j+1) % 4 ∈ (2, 3) && (j+1) ≥ 6\`). On the underlying access pattern: \`(j+1) = M\`, and the failures collapse exactly to \`out_i = M÷2\` being odd and ≥ 3.

## Workarounds for users

- \`@turbo unroll=1 for ...\` — disable unrolling on the unrolled axis. Tested correct for all shapes I tried.
- \`@turbo unroll=(2, 2) for ...\` — cross-unroll both axes by 2. Also tested correct.
- \`@turbo unroll=(1, 4)\` (only unroll the non-strided axis) does **not** fix it on this access pattern — same cleanup tail bug appears.

## Likely fix area

\`src/codegen/lowering.jl\` lines ~363-425 (\`unsigned(Ureduct) < unsigned(UF)\` branch that generates the unroll cleanup), and/or \`terminatecondition\` at \`src/codegen/loopstartstopmanager.jl:1378\`. The cleanup termination check appears to use \`UF\` instead of \`1\` for the final scalar phase when the unrolled axis is contiguous + has strided access. I have not isolated the exact off-by-one yet.

## Context

Discovered while investigating the SciML small grant for getting LoopVectorization tests green on macOS ARM (companion PRs JuliaSIMD/LoopVectorization.jl#569 and JuliaSIMD/VectorizationBase.jl#127, which fix the W=1 nested VecUnroll store and BitVector dynamic-index load issues respectively).

Filing this so the remaining \`@test_broken\` in \`shuffleloadstores.jl\` has a concrete pointer to the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

@turbo loses last cleanup iteration with strided contiguous load when out_i mod W != 0 (Apple ARM) #570

Reproducer (Apple M-series, Julia 1.11.8, LoopVectorization master, VectorizationBase 0.21.72)

Pattern

Trigger conditions

Test gating

Workarounds for users

Likely fix area

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

M	out_i	last entry written?
4..5	2	yes
6..7	3	NO (last entry stays at NaN)
8..9	4	yes
10..11	5	NO
12..13	6	yes
14..15	7	NO

@turbo loses last cleanup iteration with strided contiguous load when out_i mod W != 0 (Apple ARM) #570

Description

Reproducer (Apple M-series, Julia 1.11.8, LoopVectorization master, VectorizationBase 0.21.72)

Pattern

Trigger conditions

Test gating

Workarounds for users

Likely fix area

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions