[Clang][Sema] Avoid duplicate diagnostics for incomplete types in nested name specifier (C++20+) #147036

zhy-tju · 2025-07-04T10:44:35Z

Linked issue #147000
Clang currently emits duplicate diagnostics when encountering an incomplete
type in a nested name specifier (e.g., incomplete::type) in C++20 or later.
This is due to multiple semantic analysis paths (such as scope resolution
and qualified type building) triggering the same diagnostic.

This patch suppresses duplicate errors by recording diagnosed TagDecls
in a DenseSet within Sema (IncompleteDiagSet). If a TagDecl has already
triggered a diagnostic for being incomplete in a nested name specifier, it
will be skipped on subsequent checks.

github-actions · 2025-07-04T10:44:52Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-07-04T10:45:19Z

@llvm/pr-subscribers-mlir
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: None (zhy-tju)

Changes

Linked issue #147000
Clang currently emits duplicate diagnostics when encountering an incomplete
type in a nested name specifier (e.g., incomplete::type) in C++20 or later.
This is due to multiple semantic analysis paths (such as scope resolution
and qualified type building) triggering the same diagnostic.

This patch suppresses duplicate errors by recording diagnosed TagDecls
in a DenseSet within Sema (IncompleteDiagSet). If a TagDecl has already
triggered a diagnostic for being incomplete in a nested name specifier, it
will be skipped on subsequent checks.

Full diff: https://github.com/llvm/llvm-project/pull/147036.diff

4 Files Affected:

(modified) clang/include/clang/Sema/Sema.h (+6)
(modified) clang/lib/Sema/SemaCXXScopeSpec.cpp (+9)
(modified) clang/test/SemaCXX/nested-name-spec.cpp (+7)
(modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+110-1)

diff --git a/clang/include/clang/Sema/Sema.h b/clang/include/clang/Sema/Sema.h
index 3fe26f950ad51..1c7a67d32cf72 100644
--- a/clang/include/clang/Sema/Sema.h
+++ b/clang/include/clang/Sema/Sema.h
@@ -1555,6 +1555,12 @@ class Sema final : public SemaBase {
   Sema(const Sema &) = delete;
   void operator=(const Sema &) = delete;
 
+  /// Used to suppress duplicate diagnostics for incomplete types
+  /// in nested name specifiers (e.g. `incomplete::type`).
+  /// Without this, Clang may emit the same error multiple times
+  /// in C++20 or later, due to multiple semantic passes over the scope.
+  llvm::DenseSet<const TagDecl *> IncompleteDiagSet;
+
   /// The handler for the FileChanged preprocessor events.
   ///
   /// Used for diagnostics that implement custom semantic analysis for #include
diff --git a/clang/lib/Sema/SemaCXXScopeSpec.cpp b/clang/lib/Sema/SemaCXXScopeSpec.cpp
index ab83f625d2849..8731f3cbbb8cd 100644
--- a/clang/lib/Sema/SemaCXXScopeSpec.cpp
+++ b/clang/lib/Sema/SemaCXXScopeSpec.cpp
@@ -206,13 +206,22 @@ bool Sema::RequireCompleteDeclContext(CXXScopeSpec &SS,
   if (tag->isBeingDefined())
     return false;
 
+  // Avoid emitting duplicate diagnostics for the same tag.
+  // This happens in C++20+ due to more aggressive semantic analysis.
+  if (IncompleteDiagSet.contains(tag))
+    return true;
+
   SourceLocation loc = SS.getLastQualifierNameLoc();
   if (loc.isInvalid()) loc = SS.getRange().getBegin();
 
   // The type must be complete.
   if (RequireCompleteType(loc, type, diag::err_incomplete_nested_name_spec,
                           SS.getRange())) {
+    // mark as diagnosed
+    IncompleteDiagSet.insert(tag); 
+
     SS.SetInvalid(SS.getRange());
+
     return true;
   }
 
diff --git a/clang/test/SemaCXX/nested-name-spec.cpp b/clang/test/SemaCXX/nested-name-spec.cpp
index abeaba9d8dde2..df82d7a8dcf70 100644
--- a/clang/test/SemaCXX/nested-name-spec.cpp
+++ b/clang/test/SemaCXX/nested-name-spec.cpp
@@ -1,3 +1,10 @@
+// RUN: %clang_cc1 -std=c++20 -fsyntax-only -verify %s
+
+struct incomplete;
+incomplete::type var; // expected-error{{incomplete type 'incomplete' named in nested name specifier}}
+// expected-note@-2{{forward declaration of 'incomplete'}}
+
+
 // RUN: %clang_cc1 -fsyntax-only -std=c++98 -verify -fblocks %s
 namespace A {
   struct C {
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index eadb5d9326798..8ac73322c5513 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -106,6 +106,16 @@ def AMDGPU_ExtPackedFp8Op :
     If the passed-in vector has fewer than four elements, or the input is scalar,
     the remaining values in the <4 x i8> will be filled with
     undefined values as needed.
+
+    #### Example
+    ```mlir
+    // Extract single FP8 element to scalar f32
+    %element = amdgpu.ext_packed_fp8 %src_vector[0] : vector<4xf8E4M3FNUZ> to f32
+
+    // Extract two FP8 elements to vector<2xf32>
+    %elements = amdgpu.ext_packed_fp8 %src_vector[0] : vector<4xf8E4M3FNUZ> to vector<2xf32>
+    ```
+
   }];
   let assemblyFormat = [{
     attr-dict $source `[` $index `]` `:` type($source) `to` type($res)
@@ -162,6 +172,12 @@ def AMDGPU_PackedTrunc2xFp8Op :
     sub-registers, and so the conversion intrinsics (which are currently the
     only way to work with 8-bit float types) take packed vectors of 4 8-bit
     values.
+
+    #### Example
+    ```mlir
+    %result = amdgpu.packed_trunc_2xfp8 %src1, %src2 into %dest[word 1] 
+  : f32 to vector<4xf8E5M2FNUZ> into vector<4xf8E5M2FNUZ>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $sourceA `,` ($sourceB^):(`undef`)?
@@ -220,6 +236,12 @@ def AMDGPU_PackedStochRoundFp8Op :
     sub-registers, and so the conversion intrinsics (which are currently the
     only way to work with 8-bit float types) take packed vectors of 4 8-bit
     values.
+
+    #### Example
+    ```mlir
+   %result = amdgpu.packed_stoch_round_fp8 %src + %stoch_seed into %dest[2] 
+  : f32 to vector<4xf8E5M2FNUZ> into vector<4xf8E5M2FNUZ>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $source `+` $stochiasticParam
@@ -275,6 +297,18 @@ def AMDGPU_FatRawBufferCastOp :
     If the value of the memref's offset is not uniform (independent of the lane/thread ID),
     this will lead to substantially decreased performance due to the need for
     a waterfall loop on the base address of the buffer resource.
+
+   #### Example
+   ```mlir
+  // Simple cast
+%converted = amdgpu.fat_raw_buffer_cast %src 
+  : memref<8xi32> to memref<8xi32, #amdgpu.address_space<fat_raw_buffer>>
+// Cast with memory attributes
+%converted = amdgpu.fat_raw_buffer_cast %src validBytes(%valid) 
+  cacheSwizzleStride(%swizzle) boundsCheck(false) resetOffset
+  : memref<8xi32, strided<[1], offset: ?>> 
+    to memref<8xi32, strided<[1]>, #amdgpu.address_space<fat_raw_buffer>>
+   ```
   }];
 
   let extraClassDeclaration = [{
@@ -333,6 +367,17 @@ def AMDGPU_RawBufferLoadOp :
     - If `boundsCheck` is false and the target chipset is RDNA, OOB_SELECT is set
       to 2 to disable bounds checks, otherwise it is 3
     - The cache coherency bits are off
+
+    #### Example
+    ```mlir
+    // Load scalar f32 from 1D buffer
+    %scalar = amdgpu.raw_buffer_load %src[%idx] : memref<128xf32>, i32 -> f32
+    // Load vector<4xf32> from 4D buffer
+    %vector = amdgpu.raw_buffer_load %src[%idx0, %idx1, %idx2, %idx3] 
+    : memref<128x64x32x16xf32>, i32, i32, i32, i32 -> vector<4xf32>
+    // Load from scalar buffer
+    %value = amdgpu.raw_buffer_load %src[] : memref<f32> -> f32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $memref `[` $indices `]`
@@ -372,6 +417,17 @@ def AMDGPU_RawBufferStoreOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Store scalar f32 to 1D buffer
+    amdgpu.raw_buffer_store %value -> %dst[%idx] : f32 -> memref<128xf32>, i32
+    // Store vector<4xf32> to 4D buffer
+    amdgpu.raw_buffer_store %vec -> %dst[%idx0, %idx1, %idx2, %idx3] 
+    : vector<4xf32> -> memref<128x64x32x16xf32>, i32, i32, i32, i32
+    // Store to scalar buffer
+    amdgpu.raw_buffer_store %value -> %dst[] : f32 -> memref<f32>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $value `->` $memref `[` $indices `]`
@@ -414,6 +470,13 @@ def AMDGPU_RawBufferAtomicCmpswapOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Atomic compare-swap
+    amdgpu.raw_buffer_atomic_cmpswap %src, %cmp -> %dst[%idx] 
+    : f32 -> memref<128xf32>, i32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
@@ -453,6 +516,13 @@ def AMDGPU_RawBufferAtomicFaddOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Atomic floating-point add
+    amdgpu.raw_buffer_atomic_fadd %value -> %dst[%idx] 
+    : f32 -> memref<128xf32>, i32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $value `->` $memref `[` $indices `]`
@@ -647,11 +717,16 @@ def AMDGPU_SwizzleBitModeOp : AMDGPU_Op<"swizzle_bitmode",
 
     Supports arbitrary int/float/vector types, which will be repacked to i32 and
     one or more `rocdl.ds_swizzle` ops during lowering.
+
+    #### Example
+    ```mlir
+ %result = amdgpu.swizzle_bitmode %src 1 2 4 : f32
+    ```
   }];
   let results = (outs AnyIntegerOrFloatOr1DVector:$result);
   let assemblyFormat = [{
     $src $and_mask $or_mask $xor_mask attr-dict `:` type($result)
-  }];
+    }];
 }
 
 def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
@@ -673,6 +748,11 @@ def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
     (those which will implement this barrier by emitting inline assembly),
     use of this operation will impede the usabiliity of memory watches (including
     breakpoints set on variables) when debugging.
+
+    #### Example
+    ```mlir
+  amdgpu.lds_barrier
+    ```
   }];
   let assemblyFormat = "attr-dict";
 }
@@ -711,6 +791,14 @@ def AMDGPU_SchedBarrierOp :
     `amdgpu.sched_barrier` serves as a barrier that could be
     configured to restrict movements of instructions through it as
     defined by sched_barrier_opts.
+
+    #### Example
+    ```mlir
+    // Barrier allowing no dependent instructions
+    amdgpu.sched_barrier allow = <none>
+    // Barrier allowing specific execution units
+    amdgpu.sched_barrier allow = <valu|all_vmem>
+    ```
   }];
   let assemblyFormat = [{
     `allow` `=` $opts attr-dict
@@ -810,6 +898,13 @@ def AMDGPU_MFMAOp :
 
     The negateA, negateB, and negateC flags are only supported for double-precision
     operations on gfx94x.
+
+    #### Example
+    ```mlir
+  %result = amdgpu.mfma %a * %b + %c 
+  { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } 
+  : f32, f32, vector<32xf32>
+    ```
   }];
   let assemblyFormat = [{
     $sourceA `*` $sourceB `+` $destC
@@ -851,6 +946,12 @@ def AMDGPU_WMMAOp :
 
     The `clamp` flag is used to saturate the output of type T to numeric_limits<T>::max()
     in case of overflow.
+
+    #### Example
+    ```mlir
+  %result = amdgpu.wmma %a * %b + %c 
+  : vector<16xf16>, vector<16xf16>, vector<8xf16>
+    ```
   }];
   let assemblyFormat = [{
     $sourceA `*` $sourceB `+` $destC
@@ -973,6 +1074,14 @@ def AMDGPU_ScaledMFMAOp :
     are omitted from this wrapper.
     - The `negateA`, `negateB`, and `negateC` flags in `amdgpu.mfma` are only supported for 
     double-precision operations on gfx94x and so are not included here. 
+
+    #### Example
+    ```mlir
+ %result = amdgpu.scaled_mfma 
+  (%scale_a[0] * %vec_a) * (%scale_b[1] * %vec_b) + %accum
+  { k = 64 : i32, m = 32 : i32, n = 32 : i32 } 
+  : f8E8M0FNU, vector<32xf6E2M3FN>, f8E8M0FNU, vector<32xf6E2M3FN>, vector<16xf32>
+    ```
   }];
   let assemblyFormat = [{
     `(` $scalesA `[` $scalesIdxA `]` `*` $sourceA `)` `*` `(` $scalesB `[` $scalesIdxB `]` `*` $sourceB `)` `+` $destC

llvmbot · 2025-07-04T10:45:20Z

@llvm/pr-subscribers-mlir-amdgpu

Author: None (zhy-tju)

Changes

Linked issue #147000
Clang currently emits duplicate diagnostics when encountering an incomplete
type in a nested name specifier (e.g., incomplete::type) in C++20 or later.
This is due to multiple semantic analysis paths (such as scope resolution
and qualified type building) triggering the same diagnostic.

This patch suppresses duplicate errors by recording diagnosed TagDecls
in a DenseSet within Sema (IncompleteDiagSet). If a TagDecl has already
triggered a diagnostic for being incomplete in a nested name specifier, it
will be skipped on subsequent checks.

Full diff: https://github.com/llvm/llvm-project/pull/147036.diff

4 Files Affected:

(modified) clang/include/clang/Sema/Sema.h (+6)
(modified) clang/lib/Sema/SemaCXXScopeSpec.cpp (+9)
(modified) clang/test/SemaCXX/nested-name-spec.cpp (+7)
(modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+110-1)

diff --git a/clang/include/clang/Sema/Sema.h b/clang/include/clang/Sema/Sema.h
index 3fe26f950ad51..1c7a67d32cf72 100644
--- a/clang/include/clang/Sema/Sema.h
+++ b/clang/include/clang/Sema/Sema.h
@@ -1555,6 +1555,12 @@ class Sema final : public SemaBase {
   Sema(const Sema &) = delete;
   void operator=(const Sema &) = delete;
 
+  /// Used to suppress duplicate diagnostics for incomplete types
+  /// in nested name specifiers (e.g. `incomplete::type`).
+  /// Without this, Clang may emit the same error multiple times
+  /// in C++20 or later, due to multiple semantic passes over the scope.
+  llvm::DenseSet<const TagDecl *> IncompleteDiagSet;
+
   /// The handler for the FileChanged preprocessor events.
   ///
   /// Used for diagnostics that implement custom semantic analysis for #include
diff --git a/clang/lib/Sema/SemaCXXScopeSpec.cpp b/clang/lib/Sema/SemaCXXScopeSpec.cpp
index ab83f625d2849..8731f3cbbb8cd 100644
--- a/clang/lib/Sema/SemaCXXScopeSpec.cpp
+++ b/clang/lib/Sema/SemaCXXScopeSpec.cpp
@@ -206,13 +206,22 @@ bool Sema::RequireCompleteDeclContext(CXXScopeSpec &SS,
   if (tag->isBeingDefined())
     return false;
 
+  // Avoid emitting duplicate diagnostics for the same tag.
+  // This happens in C++20+ due to more aggressive semantic analysis.
+  if (IncompleteDiagSet.contains(tag))
+    return true;
+
   SourceLocation loc = SS.getLastQualifierNameLoc();
   if (loc.isInvalid()) loc = SS.getRange().getBegin();
 
   // The type must be complete.
   if (RequireCompleteType(loc, type, diag::err_incomplete_nested_name_spec,
                           SS.getRange())) {
+    // mark as diagnosed
+    IncompleteDiagSet.insert(tag); 
+
     SS.SetInvalid(SS.getRange());
+
     return true;
   }
 
diff --git a/clang/test/SemaCXX/nested-name-spec.cpp b/clang/test/SemaCXX/nested-name-spec.cpp
index abeaba9d8dde2..df82d7a8dcf70 100644
--- a/clang/test/SemaCXX/nested-name-spec.cpp
+++ b/clang/test/SemaCXX/nested-name-spec.cpp
@@ -1,3 +1,10 @@
+// RUN: %clang_cc1 -std=c++20 -fsyntax-only -verify %s
+
+struct incomplete;
+incomplete::type var; // expected-error{{incomplete type 'incomplete' named in nested name specifier}}
+// expected-note@-2{{forward declaration of 'incomplete'}}
+
+
 // RUN: %clang_cc1 -fsyntax-only -std=c++98 -verify -fblocks %s
 namespace A {
   struct C {
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index eadb5d9326798..8ac73322c5513 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -106,6 +106,16 @@ def AMDGPU_ExtPackedFp8Op :
     If the passed-in vector has fewer than four elements, or the input is scalar,
     the remaining values in the <4 x i8> will be filled with
     undefined values as needed.
+
+    #### Example
+    ```mlir
+    // Extract single FP8 element to scalar f32
+    %element = amdgpu.ext_packed_fp8 %src_vector[0] : vector<4xf8E4M3FNUZ> to f32
+
+    // Extract two FP8 elements to vector<2xf32>
+    %elements = amdgpu.ext_packed_fp8 %src_vector[0] : vector<4xf8E4M3FNUZ> to vector<2xf32>
+    ```
+
   }];
   let assemblyFormat = [{
     attr-dict $source `[` $index `]` `:` type($source) `to` type($res)
@@ -162,6 +172,12 @@ def AMDGPU_PackedTrunc2xFp8Op :
     sub-registers, and so the conversion intrinsics (which are currently the
     only way to work with 8-bit float types) take packed vectors of 4 8-bit
     values.
+
+    #### Example
+    ```mlir
+    %result = amdgpu.packed_trunc_2xfp8 %src1, %src2 into %dest[word 1] 
+  : f32 to vector<4xf8E5M2FNUZ> into vector<4xf8E5M2FNUZ>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $sourceA `,` ($sourceB^):(`undef`)?
@@ -220,6 +236,12 @@ def AMDGPU_PackedStochRoundFp8Op :
     sub-registers, and so the conversion intrinsics (which are currently the
     only way to work with 8-bit float types) take packed vectors of 4 8-bit
     values.
+
+    #### Example
+    ```mlir
+   %result = amdgpu.packed_stoch_round_fp8 %src + %stoch_seed into %dest[2] 
+  : f32 to vector<4xf8E5M2FNUZ> into vector<4xf8E5M2FNUZ>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $source `+` $stochiasticParam
@@ -275,6 +297,18 @@ def AMDGPU_FatRawBufferCastOp :
     If the value of the memref's offset is not uniform (independent of the lane/thread ID),
     this will lead to substantially decreased performance due to the need for
     a waterfall loop on the base address of the buffer resource.
+
+   #### Example
+   ```mlir
+  // Simple cast
+%converted = amdgpu.fat_raw_buffer_cast %src 
+  : memref<8xi32> to memref<8xi32, #amdgpu.address_space<fat_raw_buffer>>
+// Cast with memory attributes
+%converted = amdgpu.fat_raw_buffer_cast %src validBytes(%valid) 
+  cacheSwizzleStride(%swizzle) boundsCheck(false) resetOffset
+  : memref<8xi32, strided<[1], offset: ?>> 
+    to memref<8xi32, strided<[1]>, #amdgpu.address_space<fat_raw_buffer>>
+   ```
   }];
 
   let extraClassDeclaration = [{
@@ -333,6 +367,17 @@ def AMDGPU_RawBufferLoadOp :
     - If `boundsCheck` is false and the target chipset is RDNA, OOB_SELECT is set
       to 2 to disable bounds checks, otherwise it is 3
     - The cache coherency bits are off
+
+    #### Example
+    ```mlir
+    // Load scalar f32 from 1D buffer
+    %scalar = amdgpu.raw_buffer_load %src[%idx] : memref<128xf32>, i32 -> f32
+    // Load vector<4xf32> from 4D buffer
+    %vector = amdgpu.raw_buffer_load %src[%idx0, %idx1, %idx2, %idx3] 
+    : memref<128x64x32x16xf32>, i32, i32, i32, i32 -> vector<4xf32>
+    // Load from scalar buffer
+    %value = amdgpu.raw_buffer_load %src[] : memref<f32> -> f32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $memref `[` $indices `]`
@@ -372,6 +417,17 @@ def AMDGPU_RawBufferStoreOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Store scalar f32 to 1D buffer
+    amdgpu.raw_buffer_store %value -> %dst[%idx] : f32 -> memref<128xf32>, i32
+    // Store vector<4xf32> to 4D buffer
+    amdgpu.raw_buffer_store %vec -> %dst[%idx0, %idx1, %idx2, %idx3] 
+    : vector<4xf32> -> memref<128x64x32x16xf32>, i32, i32, i32, i32
+    // Store to scalar buffer
+    amdgpu.raw_buffer_store %value -> %dst[] : f32 -> memref<f32>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $value `->` $memref `[` $indices `]`
@@ -414,6 +470,13 @@ def AMDGPU_RawBufferAtomicCmpswapOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Atomic compare-swap
+    amdgpu.raw_buffer_atomic_cmpswap %src, %cmp -> %dst[%idx] 
+    : f32 -> memref<128xf32>, i32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
@@ -453,6 +516,13 @@ def AMDGPU_RawBufferAtomicFaddOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Atomic floating-point add
+    amdgpu.raw_buffer_atomic_fadd %value -> %dst[%idx] 
+    : f32 -> memref<128xf32>, i32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $value `->` $memref `[` $indices `]`
@@ -647,11 +717,16 @@ def AMDGPU_SwizzleBitModeOp : AMDGPU_Op<"swizzle_bitmode",
 
     Supports arbitrary int/float/vector types, which will be repacked to i32 and
     one or more `rocdl.ds_swizzle` ops during lowering.
+
+    #### Example
+    ```mlir
+ %result = amdgpu.swizzle_bitmode %src 1 2 4 : f32
+    ```
   }];
   let results = (outs AnyIntegerOrFloatOr1DVector:$result);
   let assemblyFormat = [{
     $src $and_mask $or_mask $xor_mask attr-dict `:` type($result)
-  }];
+    }];
 }
 
 def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
@@ -673,6 +748,11 @@ def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
     (those which will implement this barrier by emitting inline assembly),
     use of this operation will impede the usabiliity of memory watches (including
     breakpoints set on variables) when debugging.
+
+    #### Example
+    ```mlir
+  amdgpu.lds_barrier
+    ```
   }];
   let assemblyFormat = "attr-dict";
 }
@@ -711,6 +791,14 @@ def AMDGPU_SchedBarrierOp :
     `amdgpu.sched_barrier` serves as a barrier that could be
     configured to restrict movements of instructions through it as
     defined by sched_barrier_opts.
+
+    #### Example
+    ```mlir
+    // Barrier allowing no dependent instructions
+    amdgpu.sched_barrier allow = <none>
+    // Barrier allowing specific execution units
+    amdgpu.sched_barrier allow = <valu|all_vmem>
+    ```
   }];
   let assemblyFormat = [{
     `allow` `=` $opts attr-dict
@@ -810,6 +898,13 @@ def AMDGPU_MFMAOp :
 
     The negateA, negateB, and negateC flags are only supported for double-precision
     operations on gfx94x.
+
+    #### Example
+    ```mlir
+  %result = amdgpu.mfma %a * %b + %c 
+  { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } 
+  : f32, f32, vector<32xf32>
+    ```
   }];
   let assemblyFormat = [{
     $sourceA `*` $sourceB `+` $destC
@@ -851,6 +946,12 @@ def AMDGPU_WMMAOp :
 
     The `clamp` flag is used to saturate the output of type T to numeric_limits<T>::max()
     in case of overflow.
+
+    #### Example
+    ```mlir
+  %result = amdgpu.wmma %a * %b + %c 
+  : vector<16xf16>, vector<16xf16>, vector<8xf16>
+    ```
   }];
   let assemblyFormat = [{
     $sourceA `*` $sourceB `+` $destC
@@ -973,6 +1074,14 @@ def AMDGPU_ScaledMFMAOp :
     are omitted from this wrapper.
     - The `negateA`, `negateB`, and `negateC` flags in `amdgpu.mfma` are only supported for 
     double-precision operations on gfx94x and so are not included here. 
+
+    #### Example
+    ```mlir
+ %result = amdgpu.scaled_mfma 
+  (%scale_a[0] * %vec_a) * (%scale_b[1] * %vec_b) + %accum
+  { k = 64 : i32, m = 32 : i32, n = 32 : i32 } 
+  : f8E8M0FNU, vector<32xf6E2M3FN>, f8E8M0FNU, vector<32xf6E2M3FN>, vector<16xf32>
+    ```
   }];
   let assemblyFormat = [{
     `(` $scalesA `[` $scalesIdxA `]` `*` $sourceA `)` `*` `(` $scalesB `[` $scalesIdxB `]` `*` $sourceB `)` `+` $destC

llvmbot · 2025-07-04T10:45:21Z

@llvm/pr-subscribers-mlir-gpu

Author: None (zhy-tju)

Changes

Linked issue #147000
Clang currently emits duplicate diagnostics when encountering an incomplete
type in a nested name specifier (e.g., incomplete::type) in C++20 or later.
This is due to multiple semantic analysis paths (such as scope resolution
and qualified type building) triggering the same diagnostic.

This patch suppresses duplicate errors by recording diagnosed TagDecls
in a DenseSet within Sema (IncompleteDiagSet). If a TagDecl has already
triggered a diagnostic for being incomplete in a nested name specifier, it
will be skipped on subsequent checks.

Full diff: https://github.com/llvm/llvm-project/pull/147036.diff

4 Files Affected:

(modified) clang/include/clang/Sema/Sema.h (+6)
(modified) clang/lib/Sema/SemaCXXScopeSpec.cpp (+9)
(modified) clang/test/SemaCXX/nested-name-spec.cpp (+7)
(modified) mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td (+110-1)

diff --git a/clang/include/clang/Sema/Sema.h b/clang/include/clang/Sema/Sema.h
index 3fe26f950ad51..1c7a67d32cf72 100644
--- a/clang/include/clang/Sema/Sema.h
+++ b/clang/include/clang/Sema/Sema.h
@@ -1555,6 +1555,12 @@ class Sema final : public SemaBase {
   Sema(const Sema &) = delete;
   void operator=(const Sema &) = delete;
 
+  /// Used to suppress duplicate diagnostics for incomplete types
+  /// in nested name specifiers (e.g. `incomplete::type`).
+  /// Without this, Clang may emit the same error multiple times
+  /// in C++20 or later, due to multiple semantic passes over the scope.
+  llvm::DenseSet<const TagDecl *> IncompleteDiagSet;
+
   /// The handler for the FileChanged preprocessor events.
   ///
   /// Used for diagnostics that implement custom semantic analysis for #include
diff --git a/clang/lib/Sema/SemaCXXScopeSpec.cpp b/clang/lib/Sema/SemaCXXScopeSpec.cpp
index ab83f625d2849..8731f3cbbb8cd 100644
--- a/clang/lib/Sema/SemaCXXScopeSpec.cpp
+++ b/clang/lib/Sema/SemaCXXScopeSpec.cpp
@@ -206,13 +206,22 @@ bool Sema::RequireCompleteDeclContext(CXXScopeSpec &SS,
   if (tag->isBeingDefined())
     return false;
 
+  // Avoid emitting duplicate diagnostics for the same tag.
+  // This happens in C++20+ due to more aggressive semantic analysis.
+  if (IncompleteDiagSet.contains(tag))
+    return true;
+
   SourceLocation loc = SS.getLastQualifierNameLoc();
   if (loc.isInvalid()) loc = SS.getRange().getBegin();
 
   // The type must be complete.
   if (RequireCompleteType(loc, type, diag::err_incomplete_nested_name_spec,
                           SS.getRange())) {
+    // mark as diagnosed
+    IncompleteDiagSet.insert(tag); 
+
     SS.SetInvalid(SS.getRange());
+
     return true;
   }
 
diff --git a/clang/test/SemaCXX/nested-name-spec.cpp b/clang/test/SemaCXX/nested-name-spec.cpp
index abeaba9d8dde2..df82d7a8dcf70 100644
--- a/clang/test/SemaCXX/nested-name-spec.cpp
+++ b/clang/test/SemaCXX/nested-name-spec.cpp
@@ -1,3 +1,10 @@
+// RUN: %clang_cc1 -std=c++20 -fsyntax-only -verify %s
+
+struct incomplete;
+incomplete::type var; // expected-error{{incomplete type 'incomplete' named in nested name specifier}}
+// expected-note@-2{{forward declaration of 'incomplete'}}
+
+
 // RUN: %clang_cc1 -fsyntax-only -std=c++98 -verify -fblocks %s
 namespace A {
   struct C {
diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
index eadb5d9326798..8ac73322c5513 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPU.td
@@ -106,6 +106,16 @@ def AMDGPU_ExtPackedFp8Op :
     If the passed-in vector has fewer than four elements, or the input is scalar,
     the remaining values in the <4 x i8> will be filled with
     undefined values as needed.
+
+    #### Example
+    ```mlir
+    // Extract single FP8 element to scalar f32
+    %element = amdgpu.ext_packed_fp8 %src_vector[0] : vector<4xf8E4M3FNUZ> to f32
+
+    // Extract two FP8 elements to vector<2xf32>
+    %elements = amdgpu.ext_packed_fp8 %src_vector[0] : vector<4xf8E4M3FNUZ> to vector<2xf32>
+    ```
+
   }];
   let assemblyFormat = [{
     attr-dict $source `[` $index `]` `:` type($source) `to` type($res)
@@ -162,6 +172,12 @@ def AMDGPU_PackedTrunc2xFp8Op :
     sub-registers, and so the conversion intrinsics (which are currently the
     only way to work with 8-bit float types) take packed vectors of 4 8-bit
     values.
+
+    #### Example
+    ```mlir
+    %result = amdgpu.packed_trunc_2xfp8 %src1, %src2 into %dest[word 1] 
+  : f32 to vector<4xf8E5M2FNUZ> into vector<4xf8E5M2FNUZ>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $sourceA `,` ($sourceB^):(`undef`)?
@@ -220,6 +236,12 @@ def AMDGPU_PackedStochRoundFp8Op :
     sub-registers, and so the conversion intrinsics (which are currently the
     only way to work with 8-bit float types) take packed vectors of 4 8-bit
     values.
+
+    #### Example
+    ```mlir
+   %result = amdgpu.packed_stoch_round_fp8 %src + %stoch_seed into %dest[2] 
+  : f32 to vector<4xf8E5M2FNUZ> into vector<4xf8E5M2FNUZ>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $source `+` $stochiasticParam
@@ -275,6 +297,18 @@ def AMDGPU_FatRawBufferCastOp :
     If the value of the memref's offset is not uniform (independent of the lane/thread ID),
     this will lead to substantially decreased performance due to the need for
     a waterfall loop on the base address of the buffer resource.
+
+   #### Example
+   ```mlir
+  // Simple cast
+%converted = amdgpu.fat_raw_buffer_cast %src 
+  : memref<8xi32> to memref<8xi32, #amdgpu.address_space<fat_raw_buffer>>
+// Cast with memory attributes
+%converted = amdgpu.fat_raw_buffer_cast %src validBytes(%valid) 
+  cacheSwizzleStride(%swizzle) boundsCheck(false) resetOffset
+  : memref<8xi32, strided<[1], offset: ?>> 
+    to memref<8xi32, strided<[1]>, #amdgpu.address_space<fat_raw_buffer>>
+   ```
   }];
 
   let extraClassDeclaration = [{
@@ -333,6 +367,17 @@ def AMDGPU_RawBufferLoadOp :
     - If `boundsCheck` is false and the target chipset is RDNA, OOB_SELECT is set
       to 2 to disable bounds checks, otherwise it is 3
     - The cache coherency bits are off
+
+    #### Example
+    ```mlir
+    // Load scalar f32 from 1D buffer
+    %scalar = amdgpu.raw_buffer_load %src[%idx] : memref<128xf32>, i32 -> f32
+    // Load vector<4xf32> from 4D buffer
+    %vector = amdgpu.raw_buffer_load %src[%idx0, %idx1, %idx2, %idx3] 
+    : memref<128x64x32x16xf32>, i32, i32, i32, i32 -> vector<4xf32>
+    // Load from scalar buffer
+    %value = amdgpu.raw_buffer_load %src[] : memref<f32> -> f32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $memref `[` $indices `]`
@@ -372,6 +417,17 @@ def AMDGPU_RawBufferStoreOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Store scalar f32 to 1D buffer
+    amdgpu.raw_buffer_store %value -> %dst[%idx] : f32 -> memref<128xf32>, i32
+    // Store vector<4xf32> to 4D buffer
+    amdgpu.raw_buffer_store %vec -> %dst[%idx0, %idx1, %idx2, %idx3] 
+    : vector<4xf32> -> memref<128x64x32x16xf32>, i32, i32, i32, i32
+    // Store to scalar buffer
+    amdgpu.raw_buffer_store %value -> %dst[] : f32 -> memref<f32>
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $value `->` $memref `[` $indices `]`
@@ -414,6 +470,13 @@ def AMDGPU_RawBufferAtomicCmpswapOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Atomic compare-swap
+    amdgpu.raw_buffer_atomic_cmpswap %src, %cmp -> %dst[%idx] 
+    : f32 -> memref<128xf32>, i32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $src `,` $cmp `->` $memref `[` $indices `]`
@@ -453,6 +516,13 @@ def AMDGPU_RawBufferAtomicFaddOp :
 
     See `amdgpu.raw_buffer_load` for a description of how the underlying
     instruction is constructed.
+
+    #### Example
+    ```mlir
+    // Atomic floating-point add
+    amdgpu.raw_buffer_atomic_fadd %value -> %dst[%idx] 
+    : f32 -> memref<128xf32>, i32
+    ```
   }];
   let assemblyFormat = [{
     attr-dict $value `->` $memref `[` $indices `]`
@@ -647,11 +717,16 @@ def AMDGPU_SwizzleBitModeOp : AMDGPU_Op<"swizzle_bitmode",
 
     Supports arbitrary int/float/vector types, which will be repacked to i32 and
     one or more `rocdl.ds_swizzle` ops during lowering.
+
+    #### Example
+    ```mlir
+ %result = amdgpu.swizzle_bitmode %src 1 2 4 : f32
+    ```
   }];
   let results = (outs AnyIntegerOrFloatOr1DVector:$result);
   let assemblyFormat = [{
     $src $and_mask $or_mask $xor_mask attr-dict `:` type($result)
-  }];
+    }];
 }
 
 def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
@@ -673,6 +748,11 @@ def AMDGPU_LDSBarrierOp : AMDGPU_Op<"lds_barrier"> {
     (those which will implement this barrier by emitting inline assembly),
     use of this operation will impede the usabiliity of memory watches (including
     breakpoints set on variables) when debugging.
+
+    #### Example
+    ```mlir
+  amdgpu.lds_barrier
+    ```
   }];
   let assemblyFormat = "attr-dict";
 }
@@ -711,6 +791,14 @@ def AMDGPU_SchedBarrierOp :
     `amdgpu.sched_barrier` serves as a barrier that could be
     configured to restrict movements of instructions through it as
     defined by sched_barrier_opts.
+
+    #### Example
+    ```mlir
+    // Barrier allowing no dependent instructions
+    amdgpu.sched_barrier allow = <none>
+    // Barrier allowing specific execution units
+    amdgpu.sched_barrier allow = <valu|all_vmem>
+    ```
   }];
   let assemblyFormat = [{
     `allow` `=` $opts attr-dict
@@ -810,6 +898,13 @@ def AMDGPU_MFMAOp :
 
     The negateA, negateB, and negateC flags are only supported for double-precision
     operations on gfx94x.
+
+    #### Example
+    ```mlir
+  %result = amdgpu.mfma %a * %b + %c 
+  { abid = 1 : i32, cbsz = 1 : i32, k = 1 : i32, m = 32 : i32, n = 32 : i32, blocks = 2 : i32 } 
+  : f32, f32, vector<32xf32>
+    ```
   }];
   let assemblyFormat = [{
     $sourceA `*` $sourceB `+` $destC
@@ -851,6 +946,12 @@ def AMDGPU_WMMAOp :
 
     The `clamp` flag is used to saturate the output of type T to numeric_limits<T>::max()
     in case of overflow.
+
+    #### Example
+    ```mlir
+  %result = amdgpu.wmma %a * %b + %c 
+  : vector<16xf16>, vector<16xf16>, vector<8xf16>
+    ```
   }];
   let assemblyFormat = [{
     $sourceA `*` $sourceB `+` $destC
@@ -973,6 +1074,14 @@ def AMDGPU_ScaledMFMAOp :
     are omitted from this wrapper.
     - The `negateA`, `negateB`, and `negateC` flags in `amdgpu.mfma` are only supported for 
     double-precision operations on gfx94x and so are not included here. 
+
+    #### Example
+    ```mlir
+ %result = amdgpu.scaled_mfma 
+  (%scale_a[0] * %vec_a) * (%scale_b[1] * %vec_b) + %accum
+  { k = 64 : i32, m = 32 : i32, n = 32 : i32 } 
+  : f8E8M0FNU, vector<32xf6E2M3FN>, f8E8M0FNU, vector<32xf6E2M3FN>, vector<16xf32>
+    ```
   }];
   let assemblyFormat = [{
     `(` $scalesA `[` $scalesIdxA `]` `*` $sourceA `)` `*` `(` $scalesB `[` $scalesIdxB `]` `*` $sourceB `)` `+` $destC

zyn0217

Why doesn't C++20 below suffer from the issue?

I think it's a waste to add a Sema scope object just for diagnostic issues. It's more like that there's an underlying issue that would be otherwise hidden by the patch. Can you explore?

mizvekov · 2025-07-04T16:58:43Z

I have a work in progress patch which fixes this issue, and doesn't need to take this approach of storing diagnosed entities.

It's a big patch that doesn't target this issue specifically, but I remember I encountered this problem while refactoring things, and I removed the duplicated paths.

I don't remember specific details anymore, but here is the patch: 6b69e5a
And I plan on submitting a PR for it within the next couple of weeks.

llvmbot added clang Clang issues not falling into any other category backend:AMDGPU clang:frontend Language frontend issues, e.g. anything involving "Sema" mlir:gpu mlir mlir:amdgpu labels Jul 4, 2025

zhy-tju added 3 commits July 4, 2025 19:44

[Clang] Duplicate diagnostics in C++20+ mode

4b098e6

[Clang] Duplicate diagnostics in C++20+ mode

7a111a6

[Clang] Duplicate diagnostics

9d65048

zhy-tju force-pushed the sema branch from ed0632e to 9d65048 Compare July 4, 2025 11:46

zyn0217 reviewed Jul 4, 2025

View reviewed changes

kuhar removed mlir:gpu mlir mlir:amdgpu labels Jul 4, 2025

new_change

cdfecb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Clang][Sema] Avoid duplicate diagnostics for incomplete types in nested name specifier (C++20+) #147036

[Clang][Sema] Avoid duplicate diagnostics for incomplete types in nested name specifier (C++20+) #147036

zhy-tju commented Jul 4, 2025

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

llvmbot commented Jul 4, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jul 4, 2025

Uh oh!

llvmbot commented Jul 4, 2025

Uh oh!

zyn0217 left a comment

Uh oh!

mizvekov commented Jul 4, 2025

Uh oh!

Uh oh!

[Clang][Sema] Avoid duplicate diagnostics for incomplete types in nested name specifier (C++20+) #147036

Are you sure you want to change the base?

[Clang][Sema] Avoid duplicate diagnostics for incomplete types in nested name specifier (C++20+) #147036

Conversation

zhy-tju commented Jul 4, 2025

Uh oh!

github-actions bot commented Jul 4, 2025

Uh oh!

llvmbot commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jul 4, 2025

Uh oh!

llvmbot commented Jul 4, 2025

Uh oh!

zyn0217 left a comment

Choose a reason for hiding this comment

Uh oh!

mizvekov commented Jul 4, 2025

Uh oh!

Uh oh!

llvmbot commented Jul 4, 2025 •

edited

Loading