merge main into amd-staging#3067
Merged
Merged
Conversation
…vm#205356) @llvm.fmuladd is the IR intrinsic that leaves it up to code generation whether to fuse an FP multiply+add pair or leave them separate. Generally you only fuse them if fused mul+add has good performance. On AArch64, for the float and double instances of this intrinsic, isel was unconditionally fusing the operations. This is sensible with hardware FP, but a bad idea for the rare case of AArch64 without hardware FP, because that leads to a call to the libm `fma()` or `fmaf()` function. That function generally (in multiple libcs) seems to be much slower than separate mul+add operations. So this patch checks for the presence of FP before reporting that fusing the operations is a performance win.
closes llvm#203649 - I have added a check in `libc/src/wchar/wcslcat.cpp` to prevent overflow caused by when static_cast wraps the limit. - For the `wcsncat` implementation I have fixed the condition in the for loop to first check if `i` is within bounds preventing OOB access on `s2` I am new to the codebase so any feedback would be very helpful and I will be happy to follow up promptly after a review!
…lvm#186166) Currently, the kernel type (i.e. `generic`, `spmd`, `spmd-no-loop` and `bare`) of an `omp.target` operation is not an explicit attribute of the operation. Rather, this is inferred based on the contents of its region and clauses. The problems with this approach are that it can be a potentially resource intensive check for large kernels, and misidentifications are prone to happen based on the presence of arbitrary operations from other dialects. Since the AST already contains the information needed to identify the kernel type in a more reliable manner, this patch moves that responsiblity to the Flang frontend. Other MLIR passes that create `omp.target` operations are updated as well. One known limitation of this approach is that the MLIR op verifier for `omp.target` can't completely check that the contents of its region are compatible with the declared kernel type without being exposed to the same pattern-matching limitations that this patch is removing. Also, the `TargetOp::getInnermostCapturedOmpOp()` function is maintained but, ideally, a better solution should be implemented to remove its expensive and potentially flaky checks from MLIR.
Combined OpenMP constructs, such as `parallel do`, which represent nests of constructs where each one contains a single other construct without any other directives or statements in between, are currently not marked in any way in the MLIR representation. This works because they don't usually require any specific handling other than what would be done for the included operations. However, the handling of `target` regions needs to know whether it was part of a combined construct in order to properly optimize for the SPMD case and detect when certain clauses must be inconditionally evaluated in the host. So far, this has been achieved by having some MLIR pattern-matching logic to infer whether a nest of operations could have potentially been produced for a combined construct. This approach is error prone, computationally expensive and it can't really work in the general case. On the other hand, a compiler frontend can easily tell the difference and tag MLIR operations accordingly. This patch extends the `ComposableOpInterface` of the OpenMP dialect to handle a new `omp.combined` attribute that must be set for all leafs (except for the innermost one) on a combined construct. Verification logic is added for this interface, which is added to all operations that can be used as part of a combined construct, and the previous `target`-related pattern-matching logic is removed. This patch has to be followed up with Flang lowering changes to pass all unit tests.
This patch adds the `omp.combined` attribute to OpenMP dialect operations following changes to the `ComposableOpInterface`. This attribute is added to operations representing non-innermost leaf constructs of a combined construct and to standalone block-associated constructs that can be combined with their parent construct. Changes are made to the OpenMP lowering logic, as well as the do-concurrent, workshare and workdistribute transformation passes.
Given that XORs are associative, a XOR on `vgf2p8affineqb`'s source can be reassociated to occur after by first permuting by the matrix. If the XOR operand is a 8-bit splat, it can be applied for free by combining it with the immediate. This patch: - Folds XOR by splat on `vgf2p8affineqb`'s source into its immediate. - Only occurs when the matrix is both constant and splat across each 64-bit lane. - Can occur when the XOR is multi-use, as it can still reduce the dependency chain. - Includes test coverage for a variety of matrices and negative cases for when the fold isn't possible. Fixes llvm#179606
Signed-off-by: Tikhomirova, Kseniya <kseniya.tikhomirova@intel.com>
Adds a tablegen pattern to select BLSMSK i8 for ``` %neg = sub i8 %x, 1 %and = xor i8 %neg, %x ``` I've used Claude to generate the comment line before the tablegen entry and the ll file decoding which I confirmed after llc Fixes llvm#204984
This patch was reverted due to triggering another bug. That bug has been fixed by llvm#205275, so this should be ready to land now. Original commit message: This should make assumes a bit more efficient, since it removes a few instructions. This should also help with optimizations that are limited in how many instructions they step through. This reverts commit 053d75c.
…lvm#205623) Currently CHERICapabilityFormatBase does not provide a definition for getAlignmentMask, but does provide a declaration, which leads to warnings when building with MSVC. We want to have an abstract base here without any dynamic dispatch, which is what CRTP is for, so use it for getAlignmentMask such that the base can provide a definition that uses each derived type's implementation, just as the two base wrappers were already doing when calling getAlignmentMask. Whilst doing this we might as well move the wrappers to the header so they can be inlined (and now that getAlignmentMask is defined we can use it in the helpers rather than needing each of them to explicitly use the derived type). Fixes: 7dc09d0 ("[CHERI] Add a Support utility for determining alignment requirements of CHERI capabilities. (llvm#197402)")
…vm#205734) This wasn't checking the codegen result, so move it to the right place and use -verify instead of FileChecking stderr. Co-authored-by: Claude (Opus 4.8) <noreply@anthropic.com>
…IR to core dialects (llvm#205483) See the previous PR here: llvm#164562 It was reverted by @lforg37 because of some build bot issue: see llvm#164562 (comment). However, after checking on my end, I could not reproduce the buildbot issue. Seeing that the problem triggered in `flang` which is completely unrelated to this work, I assume that it was a builder or a flaky test problem so I'm re-opening this PR as it had been initially merged. --------- Signed-off-by: Ferdinand Lemaire <flemairen6@gmail.com> Co-authored-by: Ferdinand Lemaire <ferdinand.lemaire@woven-planet.global> Co-authored-by: Ferdinand Lemaire <flemairen6@gmail.com>
…on (llvm#202121) This patch mainly fixes a bug with parsing of unknown doxygen commands in function parameter documentation. To extract the parameter documentation from the function documentation, the whole function documentation is parsed first. Then the documentation paragraph for the requested parameter is "converted" to a string and stored as the documentation for the parameter. The string is converted by visiting and dumping all chunks of the parsed paragraph. When unknown doxygen commands are parsed (during the function documentation parsing step), they are registered in a `clang::comments::CommandTraits` object. Visiting the unknown command requires to query the registered commands through the `clang::comments::CommandTraits` object to get the command name. The bug was that the function documentation parsing and the visiting step used 2 different `clang::comments::CommandTraits` objects. Hence the visiting step fails (array access out of bounds) when trying to retrieve the command names for unknown commands. The patch moves the function documentation parsing step to the construction of the `SymbolDocCommentVisitor` which is also responsible for converting the parameter documentation paragraph to a string. This way the same `clang::comments::CommandTraits` is used and the query for unknown command names is correct. Additional fixes: - correct some whitespace behaviour for doxygen inline commands - add a new token kind for the clang comment parser to distinguish unknown "backslash" and "at" commands to correctly show them in the clangd hover info Related issue: clangd/clangd#2671
This adds a `noipa` function attribute to LLVM IR. This new attribute disables any interprocedural analysis that inspects the definition of the function. Setting this attribute is equivalent to moving the function definition to a separate, optimizer-opaque, module. The `noipa` attribute does *not* control inlining or outlining. Add the `noinline` and `nooutline` attributes as well in cases where inlining and outlining should additionally be disabled. Revival of https://reviews.llvm.org/D101011 Discussed in https://discourse.llvm.org/t/noipa-continues/74411 LLVM portion of llvm#40819
Fixes the false positive in llvm#122934 memcpy is allowed to bypass strict aliasing rules (see https://en.cppreference.com/c/string/byte/memcpy) so we shouldn't alter shadow memory when it is used
…rguments. NFC (llvm#205748) Currently if you want to use match_fn over a range of VPValues, you have to explicitly write `match_fn<VPValue>` otherwise it will resolve to the VPUser overload. This changes the functor to be a lambda with an auto argument so match_fn(...) works for both VPValues and VPUsers without explicit templates. The lambda is inlined so there's no indirect function call. vputils::getGEPFlagsForPtr is updated to use the new form. We can't use `bind_back` since it requires we bind to exactly one function that's known at call time.
…ns` opt (llvm#205764) Annotations suggestions expectedly fire very often and they have recently shown significant regressions after the llvm#204045. This now gates the suggestions behind a dedicated `SuggestAnnotations` option, preventing unnecessary work when the relevant diagnostics are disabled.
In `ExprEngine::processCallExit` step 3 may theoretically split the state because it calls `removeDead`, which activates `LiveSymbols` and `DeadSymbols` callbacks of various checkers. (However, in practice it is likely that these checker callbacks never actually split the state -- at least, no such state splits happen in the LIT tests.) The nodes produced by `removeDead` are placed in the set `CleanedNodes`; in theory the different execution paths should be handled in parallel, independently of each other. However, the loop `for (ExplodedNode *N : CleanedNodes)` contained an early return statement, which meant that if the creation of `CEENode` failed for a node `N`, then the subsequent iterations were skipped altogether. This commit replaces the `return` with a `continue` to ensure that the nodes in `CleanedNodes` are handled independently (if there are several such nodes). This logic error is present in the codebase since 2012 (!) when commit 7e53bd6 introduced the `removeDead` step into `processCallExit`. Given that nobody noticed this error within the last 14 years, I very strongly suspect that it doesn't have any observable functional effects, i.e. this change is essentially NFC.
…vm#202377) Updates NaryReassociatePass with a safety check to guard against GEPs into arrays with zero sized element types (eg. [0 x ptr]) to prevent division by zero.
llvm#205715) The watched-literal solver has a few invariant checks that run on every solver iteration in assertion builds. Some of these checks rebuild and iterate over the watched-literal state. This overhead is usually hidden, but it becomes dominant for large flow-sensitive analyses. While testing clang-tidy's `unchecked-optional-access` check on real world projects (in this case, LLVM itself), we found there are a few extreme slow analyses caused by this overhead. | Time | File | |---------|-----------------------------------------------------| | 8235.7s | llvm-project/clang/utils/TableGen/RISCVVEmitter.cpp | | 8197.2s | llvm-project/clang/lib/Driver/Multilib.cpp | (Ran on a machine with Icelake 32cores + 128gb memory) After moving these asserts to `EXPENSIVE_CHECKS`, the same files complete in about 14.2 seconds and 12.2 seconds locally. That is roughly a 580x improvement for `RISCVVEmitter.cpp` and a 673x improvement for `Multilib.cpp`. This can also affect clang-tidy pre-merge CI, because the pre-merge configuration uses an assertions build and enables `bugprone-unchecked-optional-access`. Given this scale of improvement, I think these invariant checks are better suited for `EXPENSIVE_CHECKS`. They remain available in dedicated expensive-check builds, while avoiding a very large cost in regular release+assertions builds. Closes llvm#205713
Updates to the kernel type detection logic now allow `target parallel do` to be promoted to SPMD-No-Loop. A currently broken offload test that was affected by this change is updated here.
Ensure libgen.h is included in TARGET_PUBLIC_HEADERS for Linux targets so that it gets generated and installed. Assisted-by: Automated tooling, human reviewed.
Me previous testing regarding this seems to have been insufficient. Or this regressed some time along the way. Now that `CLANG_USE_EXPERIMENTAL_CONST_INTERP` is used for testing I noticed a few regressions. We need to special-case the evaluating decl in a few places, since it's a global variable that we're allowed to modify.
…lvm#205805) It looks like there is still a bug with removing assumes from the assumption cache. Reverts llvm#205773
…re` and `gather/scatter` ops (llvm#204842) Extend negative stride checks to MaskedLoadOp, MaskedStoreOp, GatherOp, and ScatterOp to match LoadOp and StoreOp behavior. Depends on: llvm#204611. AI Disclaimer: I used AI for the tests. --------- Signed-off-by: Federico Bruzzone <federico.bruzzone.i@gmail.com>
…lvm#205518) The latency and throughput for these instructions don't match what's in the A510 Software Optimization Guide, so adjust them so that they do match. Also rearrange the definitions to match how they're structured in the optimization guide and rename things in a similar manner to how the C1 CPUs do things, as it's much clearer.
This fixes 5314be5. Signed-off-by: Ingo Müller <ingomueller@google.com> Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>
Added the POSIX unsetenv() function and its internal support. Implemented EnvironmentManager::unset() to remove a variable by name, free the string if allocated, and compact the array. Updated EnvironmentManager to synchronize the public global environ pointer when transitioning to managed storage. Registered for x86_64, aarch64, and riscv. Integration tests cover basic operations and edge cases. Assisted-by: Automated tooling, human reviewed.
It seems like using a non-`hidden` `toctree` for page navigation is a
bit of a trap, in that every doc must have a single unique path through
the global toctree to the root doc, and it is very easy to end up with
multiple.
This patch tries to address the warnings (actually infos, hence why it
does not fail the build) in llvm/docs/.
I tried to preserve the documents as-is, by hiding `toctree`s and
instead using lists of `{doc}` forms where the `toctree` was visible
before.
The only visual change in the resulting HTML is that the link is now
underlined where it wasn't before.
I also nested the `Tutorials` section in GISel Porting document, and
didn't link to it directly as the title is a bit ambiguous without the
context of the document it appears in.
I also saw warnings about a jump in heading level in
`llvm-debuginfo-analyzer/README.md` and assumed it was just a mistake,
so I collapsed the level-3 headings down to level-2.
Finally, I wrote a sphinx extension to make ambiguous toctree entries
into errors, so the docs do not regress. I hope to fix other sphinx
project in llvm-project and enable the checks for them too, assuming
this patch is accepted.
Change-Id: Icb11de69be1ea5489fba501aee4d767f5129e7e1
SelectionDAG can fold a symbol address (a kernel parameter, global
variable, or external symbol) directly into a memory instruction's
address operand, but only within a single basic block. When the address
crosses a block boundary, ISel materializes it with `MOV_B{32,64}_sym`
and the memory instruction becomes register-relative:
```ptx
mov.b64 %rd1, kernel_param_0;
ld.param.b64 %rd2, [%rd1];
ld.param.b64 %rd3, [%rd1+8];
```
instead of:
```
ld.param.b64 %rd2, [kernel_param_0];
ld.param.b64 %rd3, [kernel_param_0+8];
```
This patch adds NVPTXAddressFolder, a pre-regalloc pass that looks for
loads and stores whose address operand is defined by `MOV_B{32,64}_sym`,
then folds the symbol back into the memory operand. The mov is erased
once it has no remaining uses; if the address also feeds arithmetic or
escapes, it is kept.
To make this generic over NVPTX memory instructions, the patch enables
named operand tables for NVPTX instructions and uses them to find `addr`
and `addsp` operands instead of hardcoding opcode-specific operand
indices.
This was motivated by a CUDA.jl performance regression where `byval`
kernel parameters stopped being pre-lowered into simple loads and
exposed the missing cross-block fold in the backend.
Disclaimer: LLMs (GPT 5.5, Opus 4.8) were used to develop this PR.
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
llvm#205806) `llvm_anyptr_ty` should only allow scalar pointer types and disallow vector of pointers; fix the vector constraint for `llvm_anyptr_ty` accordingly. This fixes a regression in `llvm_anyptr_ty` that was introduced in llvm#203506. Added a unit test to verify that use of a vector of pointers for `llvm_anyptr_ty` fails verification.
…load/store` and `gather/scatter` ops" (llvm#205832) Reverts llvm#204842. That PR breaks the following two tests: * `mlir/test/Integration/Dialect/SparseTensor/CPU/reshape_dot.mlir.test` * `mlir/test/Integration/Dialect/SparseTensor/CPU/sparse_coo_test.mlir.test`
…yReassocNodes (llvm#205578) Fixes llvm#205197
This fixes f36745e. Co-authored-by: Google Bazel Bot <google-bazel-bot@google.com>
llvm#205726) Example: ```fortran subroutine sub(a, n) real(8) :: a(n, *) integer :: n, i !$acc data no_create(a) !$acc parallel loop do i = 1, n a(1, i) = 0.0d0 end do !$acc end data end subroutine ``` An assumed-size dummy array (e.g. `real(8) :: a(n,*)`) has an unknown trailing extent and is passed without a descriptor. When the OpenACC implicit-data pass builds a data clause for such an array, `generateSeqTyAccBounds` enters the unknown-shape branch but finds no descriptor (`fir.box`) to recover bounds from, and hits: assert(false && "array with unknown dimension expected to have descriptor"); The premise is wrong: an assumed-size array legitimately has no descriptor and no recoverable bounds. Fix: return empty bounds instead of asserting. The caller only assigns bounds when non-empty, so the array is mapped without bounds — the only correct option when the extent is unknown, and sufficient for presence-only clauses (`no_create`/`present`).
…205727) Example: ```fortran !$acc routine worker subroutine transform(p, n) real*8 p(*); integer n !$acc loop seq do i1 = 1, n !$acc loop seq do i2 = 1, n ! ... a dozen+ levels of nested acc loops ... !$acc loop vector do i = 1, n p(i) = p(i) + 1.0d0 end do end do end do end subroutine ``` In this code, the routine becomes a `func.func` with deeply nested orphan `acc.loop` ops. `acc-specialize-for-host` lowers them (e.g. `acc.loop` → `scf.for`) via `applyPatternsGreedily`, but leaves `GreedyRewriteConfig::maxIterations` at its default of 10. Since inner loops only become rewritable after their parents convert, a nest deeper than 10 isn't at a fixed point when the cap is hit, so the driver returns `failure()` and the pass calls `signalPassFailure()` — a spurious, diagnostic-less failure even though the conversion was progressing correctly. Fix: Run the rewrite to convergence instead of stopping at the default cap: ```cpp config.setMaxIterations(GreedyRewriteConfig::kNoLimit); ``` The patterns are strictly reductive (ops are lowered/erased, never regenerated), so this is safe. Adds a regression test with a 16-deep orphan `acc.loop` nest.
…ching (llvm#205802) The previous code resolved a `construct={...}` selector by mapping its name to a trait through a string lookup (`getOpenMPContextTraitSelectorKind` -> `getOpenMPContextTraitPropertyForSelector`). That works for a standalone leaf construct such as `parallel`, whose name matches a single construct selector string, but not for a combined/composite directive such as `target teams`: its name (`"target teams"`) matches no selector string, so the lookup returns `invalid` and the selector's construct traits are dropped. This PR adds `AppendConstructTraitsForDirective` that walks the directive sets and adds each leaf trait, so combined directives are no longer reduced to a single (or dropped) trait. It also handles the standalone `dispatch` construct trait. Fixes llvm#205664 Assisted-by: Cursor
emitNewArrayInitializer hit errorNYI for a value-initialized array new of a trivially-constructible element (new T[n]()), where the trailing parentheses mean zero-initialization. Route the trivial-ctor branch through the existing tryMemsetInitialization() helper, following classic CodeGen: a zero-initializable element gets operator new[] plus a single memset to 0, while a non-zero-initializable element (an array of pointers-to-data-member, whose null value is -1) declines the memset and falls through to the constructor loop value-initializing each element. Also add the getParent()->isEmpty() early return. tryMemsetInitialization builds the memset through the Address overload of createMemSet so the destination alignment is preserved. Found building the SPEC CPU 2026 LLVM benchmark (723.llvm_r / 823.llvm_s) with ClangIR.
Read the validated `driver-tools` build setting directly in `generate_driver_tools_def` instead of reconstructing tool names from selected `CcInfo` dependency labels through `label.name`. `generate_driver_selects` now returns the selected dependency labels, so `select_driver_tools` is removed. This fixes a silent issue when 2 multicall participants have the same `label.name`.
…unit tests (llvm#205449) Moves isStdInitializerList from misc/ExplicitConstructorCheck into utils::type_traits so it can be shared, and adds unit tests for it. NFC.
…alization (llvm#201967) Address a bug pointed out by @bjope (thank you!) - To perform struct to vector canonicalization, it is not enough that the struct layout size is the same as the vector layout size, because structs and vectors may have padding in different locations! Previously we would promote `{ i5, i5 }` as `<i5, i5>`, which is a miscompile! - I also relaxed another requirement. Previously we made sure that the struct layout size is equal to the vector allocation size. This prevented promoting `{ i32, i32, i32 }` as `<i32, i32, i32>` because the struct layout size is 3 x i32 but the vector allocation size is 4 x i32. So instead we should compare to the vector store size, which is 3 x i32.
Document the built-in compatibility for different GCC versions since GCC-5
…apture under `if` (llvm#205731) Example: ```fortran !$acc parallel if(cond) !$acc atomic capture a = a + 1 b = a !$acc end atomic !$acc end parallel ``` In this code, the `if` clause triggers host-fallback specialization. A blanket `acc.terminator` erase pattern strips the implicit terminator of the still-present `acc.atomic.capture`, so the later `getTerminator()` trips `mightHaveTerminator()`. Fix: remove that pattern — every ACC region op already erases its own terminator when it unwraps/inlines, so it was redundant (and only raced ahead to break this case). Adds a lit test.
…lvm#205738) Add pre-commit tests for scalar OR/UMax reductions whose result is only used by an eq/ne-zero comparison.
Check for the evaluating decl in `GetRefGlobal` so we don't fail too early. We also need to mark the `APValue` as constexpr-unknown when returning it, even though the backing `Descriptor` is not marked constexpr-unknown. This fixes the last differences in `constant-expression-p2280r4.cpp`.
The pass emits remarks describing `acc.firstprivate` and `acc.private` associated with OpenACC compute and loop constructs. Assisted-by: Claude Code
…#205701) When a DW_TAG_subprogram has a DW_AT_linkage_name that is not actually a mangled name and differs from DW_AT_name, lldb used the linkage name as the function's display name. For C++ the linkage name demangles back to the source name, but a plain symbol such as __main_argc_argv does not, so the function showed up under its raw linkage name in backtraces, breakpoint locations and `image lookup`, and was not findable by its source name. This happens on WebAssembly: wasi-libc renames `int main(int, char**)` to its __main_argc_argv argv-passing wrapper, keeping DW_AT_name "main" but recording DW_AT_linkage_name "__main_argc_argv". When the linkage name has no recognized mangling scheme, use DW_AT_name as the display name and keep the linkage name as the symbol, so lookups by either name still resolve.
dpalermo
approved these changes
Jun 25, 2026
__ocml_fma_f16 __ocml_exp_f16 __ocml_exp2_f16 __ocml_exp10_f1 __ocml_log2_f16 __ocml_log_f16 __ocml_log10_f16 __ocml_sqrt_f16
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.