Skip to content

[mlir][Vector] Fix mask unpacking in transfer op unrolling #144889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Groverkss
Copy link
Member

Mask vector is calculated before any permutations or broadcasting on the memory space, which implies that the outer most dimension of the vector may not corresspond to the outer most dimension of the mask. Transpose the mask before extracting out of it. The transpose eventually folds into the vector.extract once further unrolling takes place.

@llvmbot
Copy link
Member

llvmbot commented Jun 19, 2025

@llvm/pr-subscribers-mlir

Author: Kunwar Grover (Groverkss)

Changes

Mask vector is calculated before any permutations or broadcasting on the memory space, which implies that the outer most dimension of the vector may not corresspond to the outer most dimension of the mask. Transpose the mask before extracting out of it. The transpose eventually folds into the vector.extract once further unrolling takes place.


Full diff: https://github.com/llvm/llvm-project/pull/144889.diff

2 Files Affected:

  • (modified) mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp (+13-2)
  • (modified) mlir/test/Conversion/VectorToSCF/unrolled-vector-to-loops.mlir (+30)
diff --git a/mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp b/mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp
index cc5623068ab10..189bf7f619888 100644
--- a/mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp
+++ b/mlir/lib/Conversion/VectorToSCF/VectorToSCF.cpp
@@ -1208,11 +1208,22 @@ static void maybeAssignMask(OpBuilder &b, OpTy xferOp, OpTy newXferOp,
   if (xferOp.getMaskType().getRank() > 1) {
     // Unpack one dimension of the mask.
     OpBuilder::InsertionGuard guard(b);
+    Location loc = xferOp.getLoc();
     b.setInsertionPoint(newXferOp); // Insert load before newXfer.
 
+    auto expr = dyn_cast<AffineDimExpr>(
+        compressUnusedDims(xferOp.getPermutationMap()).getResult(0));
+    assert(expr && "cannot extract from dimension");
+    // Transpose dim to be the outer most dimension, so we can use
+    // vector.extract on it.
+    TypedValue<VectorType> mask = xferOp.getMask();
+    SmallVector<int64_t> perm =
+        llvm::to_vector(llvm::seq<int64_t>(mask.getType().getRank()));
+    std::swap(perm[0], perm[expr.getPosition()]);
+    mask = b.create<vector::TransposeOp>(loc, mask, perm);
+    // Extract from the transposed mask.
     llvm::SmallVector<int64_t, 1> indices({i});
-    Location loc = xferOp.getLoc();
-    auto newMask = b.create<vector::ExtractOp>(loc, xferOp.getMask(), indices);
+    auto newMask = b.create<vector::ExtractOp>(loc, mask, indices);
     newXferOp.getMaskMutable().assign(newMask);
   }
 
diff --git a/mlir/test/Conversion/VectorToSCF/unrolled-vector-to-loops.mlir b/mlir/test/Conversion/VectorToSCF/unrolled-vector-to-loops.mlir
index 7d97829c06599..8aa72086e4e0e 100644
--- a/mlir/test/Conversion/VectorToSCF/unrolled-vector-to-loops.mlir
+++ b/mlir/test/Conversion/VectorToSCF/unrolled-vector-to-loops.mlir
@@ -84,3 +84,33 @@ func.func @transfer_read_mask(%A : memref<?x?x?xf32>, %mask : vector<2x3x4xi1>)
   %vec = vector.transfer_read %A[%c0, %c0, %c0], %f0, %mask {in_bounds = [true, true, true]}: memref<?x?x?xf32>, vector<2x3x4xf32>
   return %vec : vector<2x3x4xf32>
 }
+
+// -----
+
+func.func @transfer_read_perm_mask(%A : memref<?x?x?x?xf32>, %mask : vector<3x2x4xi1>) -> (vector<2x3x4xf32>) {
+  %f0 = arith.constant 0.0: f32
+  %c0 = arith.constant 0: index
+
+  // CHECK:      vector.extract %{{.*}}[0, 0] : vector<4xi1> from vector<3x2x4xi1>
+  // CHECK-NEXT: vector.transfer_read {{.*}} : memref<?x?x?x?xf32>, vector<4xf32>
+  // CHECK-NEXT: vector.insert {{.*}} [0, 0] : vector<4xf32> into vector<2x3x4xf32>
+  // CHECK-NEXT: vector.extract %{{.*}}[1, 0] : vector<4xi1> from vector<3x2x4xi1>
+  // CHECK-NEXT: vector.transfer_read {{.*}} : memref<?x?x?x?xf32>, vector<4xf32>
+  // CHECK-NEXT: vector.insert {{.*}} [0, 1] : vector<4xf32> into vector<2x3x4xf32>
+  // CHECK-NEXT: vector.extract %{{.*}}[2, 0] : vector<4xi1> from vector<3x2x4xi1>
+  // CHECK-NEXT: vector.transfer_read {{.*}} : memref<?x?x?x?xf32>, vector<4xf32>
+  // CHECK-NEXT: vector.insert {{.*}} [0, 2] : vector<4xf32> into vector<2x3x4xf32>
+  // CHECK-NEXT: vector.extract %{{.*}}[0, 1] : vector<4xi1> from vector<3x2x4xi1>
+  // CHECK-NEXT: vector.transfer_read {{.*}} : memref<?x?x?x?xf32>, vector<4xf32>
+  // CHECK-NEXT: vector.insert {{.*}} [1, 0] : vector<4xf32> into vector<2x3x4xf32>
+  // CHECK-NEXT: vector.extract %{{.*}}[1, 1] : vector<4xi1> from vector<3x2x4xi1>
+  // CHECK-NEXT: vector.transfer_read {{.*}} : memref<?x?x?x?xf32>, vector<4xf32>
+  // CHECK-NEXT: vector.insert {{.*}} [1, 1] : vector<4xf32> into vector<2x3x4xf32>
+  // CHECK-NEXT: vector.extract %{{.*}}[2, 1] : vector<4xi1> from vector<3x2x4xi1>
+  // CHECK-NEXT: vector.transfer_read {{.*}} : memref<?x?x?x?xf32>, vector<4xf32>
+  // CHECK-NEXT: vector.insert {{.*}} [1, 2] : vector<4xf32> into vector<2x3x4xf32>
+  // CHECK-NOT: scf.if
+  // CHECK-NOT: scf.for
+  %vec = vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0, %mask {permutation_map = affine_map<(d0, d1, d2, d4) -> (d2, d0, d4)>, in_bounds = [true, true, true]}: memref<?x?x?x?xf32>, vector<2x3x4xf32>
+  return %vec : vector<2x3x4xf32>
+}

%f0 = arith.constant 0.0: f32
%c0 = arith.constant 0: index

// CHECK: vector.extract %{{.*}}[0, 0] : vector<4xi1> from vector<3x2x4xi1>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we see a little more in the test (at least the transpose on the mask) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's possible if unrolling completely, because the transpose will just fold with the vector.extract on further unrolling. The vector.extract indices do show that the mask is being read in a transposed fashion. The other solution is to have a test that doesn't unroll fully. Any ideas what would be prefered?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I missed that the test was also doing that.
There should be a max-transfer-rank or similar parameter where one could stop the unrolling at 2-D transfer reads (e.g. for HW that support > 1-D loads) but it is likely not worth the trouble at this point.

// CHECK-NEXT: vector.insert {{.*}} [1, 2] : vector<4xf32> into vector<2x3x4xf32>
// CHECK-NOT: scf.if
// CHECK-NOT: scf.for
%vec = vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0, %mask {permutation_map = affine_map<(d0, d1, d2, d4) -> (d2, d0, d4)>, in_bounds = [true, true, true]}: memref<?x?x?x?xf32>, vector<2x3x4xf32>
Copy link
Contributor

@nicolasvasilache nicolasvasilache Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it is surprising to not have the mask type as part of the op given that the mapping is not trivial between vector<2x3x4xf32> and vector<3x2x4xi1>.
@dcaballe should the parser/printer be improved? (in a future PR)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing mask type would make sense to me. We discussed something similar recently:

However, there's a broader question. Do we need to support both forms:

%vec = vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0, %mask

vs

%vec = vector.mask %mask { vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0 }

?

Also, @Groverkss , this is a very nice example that demonstrates a case where the shape of the mask and the output vectors are different. We miss such examples in ops.mlir and I'd be tempted to add it there. Just as a nice-to-have.

compressUnusedDims(xferOp.getPermutationMap()).getResult(0));
assert(expr && "cannot extract from dimension");
// Transpose dim to be the outer most dimension, so we can use
// vector.extract on it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase a bit:

vector.extract can only extract the most minor dimensions of an multi-dimensional vector.
Transpose `d0` to the most most minor dimension so we can extract the (n-1)-D submask.

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm not particularly familiar with this logic, but the test makes sense.

any permutations or broadcasting on the memory space

Is broadcasting relevant here? If yes, it would be good to add a test for that.

LGTM % minor suggestions (nice-to-haves aka nits)

// CHECK-NEXT: vector.insert {{.*}} [1, 2] : vector<4xf32> into vector<2x3x4xf32>
// CHECK-NOT: scf.if
// CHECK-NOT: scf.for
%vec = vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0, %mask {permutation_map = affine_map<(d0, d1, d2, d4) -> (d2, d0, d4)>, in_bounds = [true, true, true]}: memref<?x?x?x?xf32>, vector<2x3x4xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing mask type would make sense to me. We discussed something similar recently:

However, there's a broader question. Do we need to support both forms:

%vec = vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0, %mask

vs

%vec = vector.mask %mask { vector.transfer_read %A[%c0, %c0, %c0, %c0], %f0 }

?

Also, @Groverkss , this is a very nice example that demonstrates a case where the shape of the mask and the output vectors are different. We miss such examples in ops.mlir and I'd be tempted to add it there. Just as a nice-to-have.


// -----

func.func @transfer_read_perm_mask(%A : memref<?x?x?x?xf32>, %mask : vector<3x2x4xi1>) -> (vector<2x3x4xf32>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this and other tests in this file are missing LIT variables that would demonstrate that e.g. %MASK_1 is used for %XFER_READ_1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants