Skip to content

merge main into amd-staging#3067

Merged
ronlieb merged 69 commits into
amd-stagingfrom
amd/merge/upstream_merge_20260625114633
Jun 25, 2026
Merged

merge main into amd-staging#3067
ronlieb merged 69 commits into
amd-stagingfrom
amd/merge/upstream_merge_20260625114633

Conversation

@ronlieb

@ronlieb ronlieb commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

statham-arm and others added 30 commits June 25, 2026 11:48
…vm#205356)

@llvm.fmuladd is the IR intrinsic that leaves it up to code generation
whether to fuse an FP multiply+add pair or leave them separate.
Generally you only fuse them if fused mul+add has good performance.

On AArch64, for the float and double instances of this intrinsic, isel
was unconditionally fusing the operations. This is sensible with
hardware FP, but a bad idea for the rare case of AArch64 without
hardware FP, because that leads to a call to the libm `fma()` or
`fmaf()` function. That function generally (in multiple libcs) seems to
be much slower than separate mul+add operations. So this patch checks
for the presence of FP before reporting that fusing the operations is a
performance win.
closes llvm#203649 
- I have added a check in `libc/src/wchar/wcslcat.cpp` to prevent
overflow caused by when static_cast wraps the limit.
- For the `wcsncat` implementation I have fixed the condition in the for
loop to first check if `i` is within bounds preventing OOB access on
`s2`

I am new to the codebase so any feedback would be very helpful and I
will be happy to follow up promptly after a review!
…lvm#186166)

Currently, the kernel type (i.e. `generic`, `spmd`, `spmd-no-loop` and
`bare`) of an `omp.target` operation is not an explicit attribute of the
operation. Rather, this is inferred based on the contents of its region
and clauses.

The problems with this approach are that it can be a potentially
resource intensive check for large kernels, and misidentifications are
prone to happen based on the presence of arbitrary operations from other
dialects.

Since the AST already contains the information needed to identify the
kernel type in a more reliable manner, this patch moves that
responsiblity to the Flang frontend. Other MLIR passes that create
`omp.target` operations are updated as well.

One known limitation of this approach is that the MLIR op verifier for
`omp.target` can't completely check that the contents of its region are
compatible with the declared kernel type without being exposed to the
same pattern-matching limitations that this patch is removing. Also, the
`TargetOp::getInnermostCapturedOmpOp()` function is maintained but,
ideally, a better solution should be implemented to remove its expensive
and potentially flaky checks from MLIR.
Combined OpenMP constructs, such as `parallel do`, which represent nests
of constructs where each one contains a single other construct without
any other directives or statements in between, are currently not marked
in any way in the MLIR representation.

This works because they don't usually require any specific handling
other than what would be done for the included operations. However, the
handling of `target` regions needs to know whether it was part of a
combined construct in order to properly optimize for the SPMD case and
detect when certain clauses must be inconditionally evaluated in the
host.

So far, this has been achieved by having some MLIR pattern-matching
logic to infer whether a nest of operations could have potentially been
produced for a combined construct. This approach is error prone,
computationally expensive and it can't really work in the general case.
On the other hand, a compiler frontend can easily tell the difference
and tag MLIR operations accordingly.

This patch extends the `ComposableOpInterface` of the OpenMP dialect to
handle a new `omp.combined` attribute that must be set for all leafs
(except for the innermost one) on a combined construct. Verification
logic is added for this interface, which is added to all operations that
can be used as part of a combined construct, and the previous
`target`-related pattern-matching logic is removed.

This patch has to be followed up with Flang lowering changes to pass all
unit tests.
This patch adds the `omp.combined` attribute to OpenMP dialect
operations following changes to the `ComposableOpInterface`.

This attribute is added to operations representing non-innermost leaf
constructs of a combined construct and to standalone block-associated
constructs that can be combined with their parent construct.

Changes are made to the OpenMP lowering logic, as well as the
do-concurrent, workshare and workdistribute transformation passes.
Given that XORs are associative, a XOR on `vgf2p8affineqb`'s source can
be reassociated to occur after by first permuting by the matrix. If the
XOR operand is a 8-bit splat, it can be applied for free by combining it
with the immediate. This patch:

- Folds XOR by splat on `vgf2p8affineqb`'s source into its immediate.
- Only occurs when the matrix is both constant and splat across each
64-bit lane.
- Can occur when the XOR is multi-use, as it can still reduce the
dependency chain.
- Includes test coverage for a variety of matrices and negative cases
for when the fold isn't possible.

Fixes llvm#179606
Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
Adds a tablegen pattern to select BLSMSK i8 for 
```
  %neg = sub i8 %x, 1
  %and = xor i8 %neg, %x
```

I've used Claude to generate the comment line before the tablegen entry and the ll file decoding which I confirmed after llc

Fixes llvm#204984
This patch was reverted due to triggering another bug. That bug has been
fixed by llvm#205275, so this
should be ready to land now.

Original commit message:

This should make assumes a bit more efficient, since it removes a few
instructions. This should also help with optimizations that are limited
in how many instructions they step through.

This reverts commit 053d75c.
…lvm#205623)

Currently CHERICapabilityFormatBase does not provide a definition for
getAlignmentMask, but does provide a declaration, which leads to
warnings when building with MSVC. We want to have an abstract base here
without any dynamic dispatch, which is what CRTP is for, so use it for
getAlignmentMask such that the base can provide a definition that uses
each derived type's implementation, just as the two base wrappers were
already doing when calling getAlignmentMask. Whilst doing this we might
as well move the wrappers to the header so they can be inlined (and now
that getAlignmentMask is defined we can use it in the helpers rather
than needing each of them to explicitly use the derived type).

Fixes: 7dc09d0 ("[CHERI] Add a Support utility for determining
alignment requirements of CHERI capabilities. (llvm#197402)")
…vm#205734)

This wasn't checking the codegen result, so move it to the right place
and use -verify instead of FileChecking stderr.

Co-authored-by: Claude (Opus 4.8) <noreply@anthropic.com>
…IR to core dialects (llvm#205483)

See the previous PR here:
llvm#164562

It was reverted by @lforg37 because of some build bot issue: see
llvm#164562 (comment).

However, after checking on my end, I could not reproduce the buildbot
issue. Seeing that the problem triggered in `flang` which is completely
unrelated to this work, I assume that it was a builder or a flaky test
problem so I'm re-opening this PR as it had been initially merged.

---------

Signed-off-by: Ferdinand Lemaire <flemairen6@gmail.com>
Co-authored-by: Ferdinand Lemaire <ferdinand.lemaire@woven-planet.global>
Co-authored-by: Ferdinand Lemaire <flemairen6@gmail.com>
…on (llvm#202121)

This patch mainly fixes a bug with parsing of unknown doxygen commands
in function parameter documentation.

To extract the parameter documentation from the function documentation,
the whole function documentation is parsed first.
Then the documentation paragraph for the requested parameter is
"converted" to a string and stored as the documentation for the
parameter. The string is converted by visiting and dumping all chunks of
the parsed paragraph.

When unknown doxygen commands are parsed (during the function
documentation parsing step), they are registered in a
`clang::comments::CommandTraits` object.
Visiting the unknown command requires to query the registered commands
through the `clang::comments::CommandTraits` object to get the command
name.

The bug was that the function documentation parsing and the visiting
step used 2 different `clang::comments::CommandTraits` objects. Hence
the visiting step fails (array access out of bounds) when trying to
retrieve the command names for unknown commands.

The patch moves the function documentation parsing step to the
construction of the `SymbolDocCommentVisitor` which is also responsible
for converting the parameter documentation paragraph to a string.
This way the same `clang::comments::CommandTraits` is used and the query
for unknown command names is correct.

Additional fixes:

- correct some whitespace behaviour for doxygen inline commands
- add a new token kind for the clang comment parser to distinguish
unknown "backslash" and "at" commands to correctly show them in the
clangd hover info

Related issue: clangd/clangd#2671
This adds a `noipa` function attribute to LLVM IR. This new attribute
disables any interprocedural analysis that inspects the definition of
the function. Setting this attribute is equivalent to moving the
function definition to a separate, optimizer-opaque, module.

The `noipa` attribute does *not* control inlining or outlining. Add the
`noinline` and `nooutline` attributes as well in cases where inlining
and outlining should additionally be disabled.

Revival of https://reviews.llvm.org/D101011
Discussed in https://discourse.llvm.org/t/noipa-continues/74411

LLVM portion of llvm#40819
Fixes the false positive in
llvm#122934

memcpy is allowed to bypass strict aliasing rules (see
https://en.cppreference.com/c/string/byte/memcpy) so we shouldn't alter
shadow memory when it is used
…rguments. NFC (llvm#205748)

Currently if you want to use match_fn over a range of VPValues, you have
to explicitly write `match_fn<VPValue>` otherwise it will resolve to the
VPUser overload.

This changes the functor to be a lambda with an auto argument so
match_fn(...) works for both VPValues and VPUsers without explicit
templates. The lambda is inlined so there's no indirect function call.
vputils::getGEPFlagsForPtr is updated to use the new form.

We can't use `bind_back` since it requires we bind to exactly one
function that's known at call time.
…ns` opt (llvm#205764)

Annotations suggestions expectedly fire very often and they have
recently shown significant regressions after the
llvm#204045. This now gates the
suggestions behind a dedicated `SuggestAnnotations` option, preventing
unnecessary work when the relevant diagnostics are disabled.
In `ExprEngine::processCallExit` step 3 may theoretically split the
state because it calls `removeDead`, which activates `LiveSymbols` and
`DeadSymbols` callbacks of various checkers. (However, in practice it is
likely that these checker callbacks never actually split the state -- at
least, no such state splits happen in the LIT tests.)

The nodes produced by `removeDead` are placed in the set `CleanedNodes`;
in theory the different execution paths should be handled in parallel,
independently of each other. However, the loop `for (ExplodedNode *N :
CleanedNodes)` contained an early return statement, which meant that if
the creation of `CEENode` failed for a node `N`, then the subsequent
iterations were skipped altogether.

This commit replaces the `return` with a `continue` to ensure that the
nodes in `CleanedNodes` are handled independently (if there are several
such nodes).

This logic error is present in the codebase since 2012 (!) when commit
7e53bd6 introduced the `removeDead`
step into `processCallExit`.

Given that nobody noticed this error within the last 14 years, I very
strongly suspect that it doesn't have any observable functional effects,
i.e. this change is essentially NFC.
…vm#202377)

Updates NaryReassociatePass with a safety check to guard against GEPs
into arrays with zero sized element types (eg. [0 x ptr]) to prevent
division by zero.
llvm#205715)

The watched-literal solver has a few invariant checks that run on every
solver iteration in assertion builds. Some of these checks rebuild and
iterate over the watched-literal state. This overhead is usually hidden,
but it becomes dominant for large flow-sensitive analyses.

While testing clang-tidy's `unchecked-optional-access` check on real
world projects (in this case, LLVM itself), we found there are a few
extreme slow analyses caused by this overhead.

| Time    | File                                                |
|---------|-----------------------------------------------------|
| 8235.7s | llvm-project/clang/utils/TableGen/RISCVVEmitter.cpp |
| 8197.2s | llvm-project/clang/lib/Driver/Multilib.cpp          |

(Ran on a machine with Icelake 32cores + 128gb memory)

After moving these asserts to `EXPENSIVE_CHECKS`, the same files
complete in about 14.2 seconds and 12.2 seconds locally. That is roughly
a 580x improvement for `RISCVVEmitter.cpp` and a 673x improvement for
`Multilib.cpp`.

This can also affect clang-tidy pre-merge CI, because the pre-merge
configuration uses an assertions build and enables
`bugprone-unchecked-optional-access`.

Given this scale of improvement, I think these invariant checks are
better suited for `EXPENSIVE_CHECKS`. They remain available in dedicated
expensive-check builds, while avoiding a very large cost in regular
release+assertions builds.

Closes llvm#205713
Updates to the kernel type detection logic now allow `target parallel
do` to be promoted to SPMD-No-Loop.

A currently broken offload test that was affected by this change is
updated here.
Ensure libgen.h is included in TARGET_PUBLIC_HEADERS for Linux targets
so that it gets generated and installed.

Assisted-by: Automated tooling, human reviewed.
Me previous testing regarding this seems to have been insufficient. Or
this regressed some time along the way.

Now that `CLANG_USE_EXPERIMENTAL_CONST_INTERP` is used for testing I
noticed a few regressions.

We need to special-case the evaluating decl in a few places, since it's
a global variable that we're allowed to modify.
…lvm#205805)

It looks like there is still a bug with removing assumes from the
assumption cache.

Reverts llvm#205773
…re` and `gather/scatter` ops (llvm#204842)

Extend negative stride checks to MaskedLoadOp, MaskedStoreOp, GatherOp,
and ScatterOp to match LoadOp and StoreOp behavior.

Depends on: llvm#204611.

AI Disclaimer: I used AI for the tests.

---------

Signed-off-by: Federico Bruzzone <federico.bruzzone.i@gmail.com>
…lvm#205518)

The latency and throughput for these instructions don't match what's in
the A510 Software Optimization Guide, so adjust them so that they do
match. Also rearrange the definitions to match how they're structured in
the optimization guide and rename things in a similar manner to how the
C1 CPUs do things, as it's much clearer.
This fixes 5314be5.

Signed-off-by: Ingo Müller <ingomueller@google.com>
Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>
Added the POSIX unsetenv() function and its internal support.

Implemented EnvironmentManager::unset() to remove a variable by name,
free the string if allocated, and compact the array.

Updated EnvironmentManager to synchronize the public global environ
pointer when transitioning to managed storage.

Registered for x86_64, aarch64, and riscv. Integration tests cover basic
operations and edge cases.

Assisted-by: Automated tooling, human reviewed.
slinder1 and others added 22 commits June 25, 2026 11:21
It seems like using a non-`hidden` `toctree` for page navigation is a
bit of a trap, in that every doc must have a single unique path through
the global toctree to the root doc, and it is very easy to end up with
multiple.

This patch tries to address the warnings (actually infos, hence why it
does not fail the build) in llvm/docs/.

I tried to preserve the documents as-is, by hiding `toctree`s and
instead using lists of `{doc}` forms where the `toctree` was visible
before.

The only visual change in the resulting HTML is that the link is now
underlined where it wasn't before.

I also nested the `Tutorials` section in GISel Porting document, and
didn't link to it directly as the title is a bit ambiguous without the
context of the document it appears in.

I also saw warnings about a jump in heading level in
`llvm-debuginfo-analyzer/README.md` and assumed it was just a mistake,
so I collapsed the level-3 headings down to level-2.

Finally, I wrote a sphinx extension to make ambiguous toctree entries
into errors, so the docs do not regress. I hope to fix other sphinx
project in llvm-project and enable the checks for them too, assuming
this patch is accepted.

Change-Id: Icb11de69be1ea5489fba501aee4d767f5129e7e1
SelectionDAG can fold a symbol address (a kernel parameter, global
variable, or external symbol) directly into a memory instruction's
address operand, but only within a single basic block. When the address
crosses a block boundary, ISel materializes it with `MOV_B{32,64}_sym`
and the memory instruction becomes register-relative:

```ptx
mov.b64       %rd1, kernel_param_0;
ld.param.b64  %rd2, [%rd1];
ld.param.b64  %rd3, [%rd1+8];
```

instead of:

```
ld.param.b64  %rd2, [kernel_param_0];
ld.param.b64  %rd3, [kernel_param_0+8];
```

This patch adds NVPTXAddressFolder, a pre-regalloc pass that looks for
loads and stores whose address operand is defined by `MOV_B{32,64}_sym`,
then folds the symbol back into the memory operand. The mov is erased
once it has no remaining uses; if the address also feeds arithmetic or
escapes, it is kept.

To make this generic over NVPTX memory instructions, the patch enables
named operand tables for NVPTX instructions and uses them to find `addr`
and `addsp` operands instead of hardcoding opcode-specific operand
indices.

This was motivated by a CUDA.jl performance regression where `byval`
kernel parameters stopped being pre-lowered into simple loads and
exposed the missing cross-block fold in the backend.

Disclaimer: LLMs (GPT 5.5, Opus 4.8) were used to develop this PR.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
llvm#205806)

`llvm_anyptr_ty` should only allow scalar pointer types and disallow
vector of pointers; fix the vector constraint for `llvm_anyptr_ty`
accordingly. This fixes a regression in `llvm_anyptr_ty` that was
introduced in llvm#203506.

Added a unit test to verify that use of a vector of pointers for
`llvm_anyptr_ty` fails verification.
…load/store` and `gather/scatter` ops" (llvm#205832)

Reverts llvm#204842. That PR breaks the following two
tests:

* `mlir/test/Integration/Dialect/SparseTensor/CPU/reshape_dot.mlir.test`
*
`mlir/test/Integration/Dialect/SparseTensor/CPU/sparse_coo_test.mlir.test`
This fixes f36745e.

Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>
llvm#205726)

Example:
```fortran
subroutine sub(a, n)
  real(8) :: a(n, *)
  integer :: n, i
  !$acc data no_create(a)
  !$acc parallel loop
  do i = 1, n
    a(1, i) = 0.0d0
  end do
  !$acc end data
end subroutine
```
An assumed-size dummy array (e.g. `real(8) :: a(n,*)`) has an unknown
trailing extent and is passed without a descriptor. When the OpenACC
implicit-data pass builds a data clause for such an array,
`generateSeqTyAccBounds` enters the unknown-shape branch but finds no
descriptor (`fir.box`) to recover bounds from, and hits:

assert(false && "array with unknown dimension expected to have
descriptor");

The premise is wrong: an assumed-size array legitimately has no
descriptor and no recoverable bounds.

Fix: return empty bounds instead of asserting. The caller only assigns
bounds when non-empty, so the array is mapped without bounds — the only
correct option when the extent is unknown, and sufficient for
presence-only clauses (`no_create`/`present`).
…205727)

Example:
```fortran
!$acc routine worker
subroutine transform(p, n)
  real*8 p(*); integer n
  !$acc loop seq
  do i1 = 1, n
    !$acc loop seq
    do i2 = 1, n
      ! ... a dozen+ levels of nested acc loops ...
      !$acc loop vector
      do i = 1, n
        p(i) = p(i) + 1.0d0
      end do
    end do
  end do
end subroutine
```

In this code, the routine becomes a `func.func` with deeply nested
orphan `acc.loop` ops. `acc-specialize-for-host` lowers them (e.g.
`acc.loop` → `scf.for`) via `applyPatternsGreedily`, but leaves
`GreedyRewriteConfig::maxIterations` at its default of 10. Since inner
loops only become rewritable after their parents convert, a nest deeper
than 10 isn't at a fixed point when the cap is hit, so the driver
returns `failure()` and the pass calls `signalPassFailure()` — a
spurious, diagnostic-less failure even though the conversion was
progressing correctly.

Fix:
Run the rewrite to convergence instead of stopping at the default cap:
```cpp
config.setMaxIterations(GreedyRewriteConfig::kNoLimit);
```
The patterns are strictly reductive (ops are lowered/erased, never
regenerated), so this is safe. Adds a regression test with a 16-deep
orphan `acc.loop` nest.
…ching (llvm#205802)

The previous code resolved a `construct={...}` selector by mapping its
name to a trait through a string lookup
(`getOpenMPContextTraitSelectorKind` ->
`getOpenMPContextTraitPropertyForSelector`). That works for a standalone
leaf construct such as `parallel`, whose name matches a single construct
selector string, but not for a combined/composite directive such as
`target teams`: its name (`"target teams"`) matches no selector string,
so the lookup returns `invalid` and the selector's construct traits are
dropped.

This PR adds `AppendConstructTraitsForDirective` that walks the
directive sets and adds each leaf trait, so combined directives are no
longer reduced to a single (or dropped) trait. It also handles the
standalone `dispatch` construct trait.

Fixes llvm#205664

Assisted-by: Cursor
emitNewArrayInitializer hit errorNYI for a value-initialized array new of
a trivially-constructible element (new T[n]()), where the trailing
parentheses mean zero-initialization.  Route the trivial-ctor branch
through the existing tryMemsetInitialization() helper, following classic
CodeGen: a zero-initializable element gets operator new[] plus a single
memset to 0, while a non-zero-initializable element (an array of
pointers-to-data-member, whose null value is -1) declines the memset and
falls through to the constructor loop value-initializing each element.
Also add the getParent()->isEmpty() early return.  tryMemsetInitialization
builds the memset through the Address overload of createMemSet so the
destination alignment is preserved.

Found building the SPEC CPU 2026 LLVM benchmark (723.llvm_r / 823.llvm_s)
with ClangIR.
Read the validated `driver-tools` build setting directly in
`generate_driver_tools_def` instead of reconstructing tool names from
selected `CcInfo` dependency labels through `label.name`.
`generate_driver_selects` now returns the selected dependency labels, so
`select_driver_tools` is removed.

This fixes a silent issue when 2 multicall participants have the same
`label.name`.
…unit tests (llvm#205449)

Moves isStdInitializerList from misc/ExplicitConstructorCheck into
utils::type_traits so it can be shared, and adds unit tests for it. NFC.
…alization (llvm#201967)

Address a bug pointed out by @bjope (thank you!)

- To perform struct to vector canonicalization, it is not enough that
the struct layout size is the same as the vector layout size, because
structs and vectors may have padding in different locations! Previously
we would promote `{ i5, i5 }` as `<i5, i5>`, which is a miscompile!

- I also relaxed another requirement. Previously we made sure that the
struct layout size is equal to the vector allocation size. This
prevented promoting `{ i32, i32, i32 }` as `<i32, i32, i32>` because the
struct layout size is 3 x i32 but the vector allocation size is 4 x i32.
So instead we should compare to the vector store size, which is 3 x i32.
Document the built-in compatibility for different GCC versions since
GCC-5
…apture under `if` (llvm#205731)

Example:
```fortran
!$acc parallel if(cond)
!$acc atomic capture
  a = a + 1
  b = a
!$acc end atomic
!$acc end parallel
```

In this code, the `if` clause triggers host-fallback specialization. A
blanket `acc.terminator` erase pattern strips the implicit terminator of
the still-present `acc.atomic.capture`, so the later `getTerminator()`
trips `mightHaveTerminator()`.

Fix: remove that pattern — every ACC region op already erases its own
terminator when it unwraps/inlines, so it was redundant (and only raced
ahead to break this case). Adds a lit test.
…lvm#205738)

Add pre-commit tests for scalar OR/UMax reductions whose result is
only used by an eq/ne-zero comparison.
Check for the evaluating decl in `GetRefGlobal` so we don't fail too
early. We also need to mark the `APValue` as constexpr-unknown when
returning it, even though the backing `Descriptor` is not marked
constexpr-unknown.



This fixes the last differences in `constant-expression-p2280r4.cpp`.
The pass emits remarks describing `acc.firstprivate` and `acc.private`
associated with OpenACC compute and loop constructs.

Assisted-by: Claude Code
…#205701)

When a DW_TAG_subprogram has a DW_AT_linkage_name that is not actually a
mangled name and differs from DW_AT_name, lldb used the linkage name as
the function's display name. For C++ the linkage name demangles back to
the source name, but a plain symbol such as __main_argc_argv does not,
so the function showed up under its raw linkage name in backtraces,
breakpoint locations and `image lookup`, and was not findable by its
source name.

This happens on WebAssembly: wasi-libc renames `int main(int, char**)`
to its __main_argc_argv argv-passing wrapper, keeping DW_AT_name "main"
but recording DW_AT_linkage_name "__main_argc_argv".

When the linkage name has no recognized mangling scheme, use DW_AT_name
as the display name and keep the linkage name as the symbol, so lookups
by either name still resolve.
@ronlieb ronlieb requested review from a team, dpalermo and skganesan008 June 25, 2026 17:19
@ronlieb ronlieb requested a review from kuhar as a code owner June 25, 2026 17:19
 __ocml_fma_f16    __ocml_exp_f16   __ocml_exp2_f16
 __ocml_exp10_f1   __ocml_log2_f16  __ocml_log_f16
 __ocml_log10_f16  __ocml_sqrt_f16
@ronlieb ronlieb merged commit ecc4d9d into amd-staging Jun 25, 2026
111 of 120 checks passed
@ronlieb ronlieb deleted the amd/merge/upstream_merge_20260625114633 branch June 25, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.