Skip to content

@turbo loses last cleanup iteration with strided contiguous load when out_i mod W != 0 (Apple ARM) #570

@ChrisRackauckas

Description

@ChrisRackauckas

Reproducer (Apple M-series, Julia 1.11.8, LoopVectorization master, VectorizationBase 0.21.72)

```julia
using LoopVectorization

function f!(out, arr)
@turbo for j in axes(out, 2), i in axes(out, 1)
out[i, j] = arr[2i, 2j]
end
end

A = rand(6, 2)
out = fill(NaN, 3, 1) # 3 i-iterations, 1 j-iteration
f!(out, A)
@show out # last entry remains NaN — the loop never wrote it
```

The strided load on the contiguous axis (`arr[2i, ...]`) combined with `@turbo`'s default unroll on `:i` makes the cleanup tail skip the final iteration(s) on Apple ARM.

Pattern

The number of dropped trailing `out[i, j]` writes is roughly `out_i mod (unroll_factor * W)` when that's < unroll_factor*W and > 0. Concretely, on aarch64+Apple (NEON 128-bit):

Float64 (W=2): bug when `out_i` is odd and ≥ 3 — last 1 iteration dropped.

M out_i last entry written?
4..5 2 yes
6..7 3 NO (last entry stays at NaN)
8..9 4 yes
10..11 5 NO
12..13 6 yes
14..15 7 NO

Float32 (W=4): bug when `out_i mod 4 ∈ {1, 2, 3}` and out_i ≥ 4 — last `out_i mod 4` iterations dropped.

M out_i dropped
8..9 4 0
10..11 5 1
12..13 6 2
14..15 7 3
16..17 8 0
18..19 9 1
20..21 10 2
22..23 11 3

Trigger conditions

  • Apple aarch64 (NEON 128-bit register width).
  • An axis with a strided load on the contiguous dimension, e.g. `arr[2i, ...]` or `arr[3i, ...]` or `arr[2i-1, ...]`.
  • LV's default unroll on that axis (`@turbo unroll=1` and `@turbo unroll=(1,1)` both make the bug go away).
  • Independent of the other axis: holds whether `:j` has 1 or 40 iterations, whether the body has 1 term or 4.

The non-strided baseline `out[i,j] = arr[i, j]` does not trigger the bug — only strided loads on the unrolled axis do.

Test gating

This is the cluster of `@test_broken`s in `test/shuffleloadstores.jl` around line 494 (the `tullio_issue_131` pattern with `(j+1) % 4 ∈ (2, 3) && (j+1) ≥ 6`). On the underlying access pattern: `(j+1) = M`, and the failures collapse exactly to `out_i = M÷2` being odd and ≥ 3.

Workarounds for users

  • `@turbo unroll=1 for ...` — disable unrolling on the unrolled axis. Tested correct for all shapes I tried.
  • `@turbo unroll=(2, 2) for ...` — cross-unroll both axes by 2. Also tested correct.
  • `@turbo unroll=(1, 4)` (only unroll the non-strided axis) does not fix it on this access pattern — same cleanup tail bug appears.

Likely fix area

`src/codegen/lowering.jl` lines ~363-425 (`unsigned(Ureduct) < unsigned(UF)` branch that generates the unroll cleanup), and/or `terminatecondition` at `src/codegen/loopstartstopmanager.jl:1378`. The cleanup termination check appears to use `UF` instead of `1` for the final scalar phase when the unrolled axis is contiguous + has strided access. I have not isolated the exact off-by-one yet.

Context

Discovered while investigating the SciML small grant for getting LoopVectorization tests green on macOS ARM (companion PRs #569 and JuliaSIMD/VectorizationBase.jl#127, which fix the W=1 nested VecUnroll store and BitVector dynamic-index load issues respectively).

Filing this so the remaining `@test_broken` in `shuffleloadstores.jl` has a concrete pointer to the bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions