-
Notifications
You must be signed in to change notification settings - Fork 15.1k
[AArch64][SVE] Avoid movprfx by reusing register for _UNDEF pseudos. #166926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AArch64][SVE] Avoid movprfx by reusing register for _UNDEF pseudos. #166926
Conversation
For predicated SVE instructions where we know that the inactive lanes are undef, it is better to pick a destination register that is not unique. This avoids introducing a movprfx to copy a unique register to the destination operand, which would be needed to comply with the tied operand constraints. For example: %src1 = COPY $z1 %src2 = COPY $z2 %dst = SDIV_ZPZZ_S_UNDEF %p, %src1, %src2 Here it is beneficial to pick $z1 or $z2 as the destination register, because if it would have chosen a unique register (e.g. $z0) then the pseudo expand pass would need to insert a MOVPRFX to expand the operation into: $z0 = SDIV_ZPZZ_S_UNDEF $p0, $z1, $z2 -> $z0 = MOVPRFX $z1 $z0 = SDIV_ZPmZ_S $p0, $z0, $z2 By picking $z1 directly, we'd get: $z1 = SDIV_ZPmZ_S, $p0 $z1, $z2
|
@llvm/pr-subscribers-backend-aarch64 Author: Sander de Smalen (sdesmalen-arm) ChangesFor predicated SVE instructions where we know that the inactive lanes are undef, it is better to pick a destination register that is not unique. This avoids introducing a movprfx to copy a unique register to the destination operand, which would be needed to comply with the tied operand constraints. For example: Here it is beneficial to pick $z1 or $z2 as the destination register, because if it would have chosen a unique register (e.g. $z0) then the pseudo expand pass would need to insert a MOVPRFX to expand the operation into: By picking $z1 directly, we'd get: Patch is 98.70 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166926.diff 29 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp b/llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp
index a5048b9c9e61d..ccf28d86e9771 100644
--- a/llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp
@@ -1123,24 +1123,83 @@ unsigned AArch64RegisterInfo::getRegPressureLimit(const TargetRegisterClass *RC,
}
}
-// FORM_TRANSPOSED_REG_TUPLE nodes are created to improve register allocation
-// where a consecutive multi-vector tuple is constructed from the same indices
-// of multiple strided loads. This may still result in unnecessary copies
-// between the loads and the tuple. Here we try to return a hint to assign the
-// contiguous ZPRMulReg starting at the same register as the first operand of
-// the pseudo, which should be a subregister of the first strided load.
+// We add regalloc hints for different cases:
+// * Choosing a better destination operand for predicated SVE instructions
+// where the inactive lanes are undef, by choosing a register that is not
+// unique to the other operands of the instruction.
//
-// For example, if the first strided load has been assigned $z16_z20_z24_z28
-// and the operands of the pseudo are each accessing subregister zsub2, we
-// should look through through Order to find a contiguous register which
-// begins with $z24 (i.e. $z24_z25_z26_z27).
+// * Improve register allocation for SME multi-vector instructions where we can
+// benefit from the strided- and contiguous register multi-vector tuples.
//
+// Here FORM_TRANSPOSED_REG_TUPLE nodes are created to improve register
+// allocation where a consecutive multi-vector tuple is constructed from the
+// same indices of multiple strided loads. This may still result in
+// unnecessary copies between the loads and the tuple. Here we try to return a
+// hint to assign the contiguous ZPRMulReg starting at the same register as
+// the first operand of the pseudo, which should be a subregister of the first
+// strided load.
+//
+// For example, if the first strided load has been assigned $z16_z20_z24_z28
+// and the operands of the pseudo are each accessing subregister zsub2, we
+// should look through through Order to find a contiguous register which
+// begins with $z24 (i.e. $z24_z25_z26_z27).
bool AArch64RegisterInfo::getRegAllocationHints(
Register VirtReg, ArrayRef<MCPhysReg> Order,
SmallVectorImpl<MCPhysReg> &Hints, const MachineFunction &MF,
const VirtRegMap *VRM, const LiveRegMatrix *Matrix) const {
-
auto &ST = MF.getSubtarget<AArch64Subtarget>();
+ const AArch64InstrInfo *TII =
+ MF.getSubtarget<AArch64Subtarget>().getInstrInfo();
+ const MachineRegisterInfo &MRI = MF.getRegInfo();
+
+ // For predicated SVE instructions where the inactive lanes are undef,
+ // pick a destination register that is not unique to avoid introducing
+ // a movprfx to copy a unique register to the destination operand.
+ const TargetRegisterClass *RegRC = MRI.getRegClass(VirtReg);
+ if (ST.isSVEorStreamingSVEAvailable() &&
+ AArch64::ZPRRegClass.hasSubClassEq(RegRC)) {
+ for (const MachineOperand &DefOp : MRI.def_operands(VirtReg)) {
+ const MachineInstr &Def = *DefOp.getParent();
+ if (DefOp.isImplicit() ||
+ (TII->get(Def.getOpcode()).TSFlags & AArch64::FalseLanesMask) !=
+ AArch64::FalseLanesUndef)
+ continue;
+
+ for (MCPhysReg R : Order) {
+ auto AddHintIfSuitable = [&](MCPhysReg R, const MachineOperand &MO) {
+ if (!VRM->hasPhys(MO.getReg()) || VRM->getPhys(MO.getReg()) == R)
+ Hints.push_back(R);
+ };
+
+ unsigned Opcode = AArch64::getSVEPseudoMap(Def.getOpcode());
+ switch (TII->get(Opcode).TSFlags & AArch64::DestructiveInstTypeMask) {
+ default:
+ break;
+ case AArch64::DestructiveTernaryCommWithRev:
+ AddHintIfSuitable(R, Def.getOperand(2));
+ AddHintIfSuitable(R, Def.getOperand(3));
+ AddHintIfSuitable(R, Def.getOperand(4));
+ break;
+ case AArch64::DestructiveBinaryComm:
+ case AArch64::DestructiveBinaryCommWithRev:
+ AddHintIfSuitable(R, Def.getOperand(2));
+ AddHintIfSuitable(R, Def.getOperand(3));
+ break;
+ case AArch64::DestructiveBinary:
+ case AArch64::DestructiveBinaryImm:
+ case AArch64::DestructiveUnaryPassthru:
+ case AArch64::Destructive2xRegImmUnpred:
+ AddHintIfSuitable(R, Def.getOperand(2));
+ break;
+ }
+ }
+ }
+
+ if (Hints.size())
+ return TargetRegisterInfo::getRegAllocationHints(VirtReg, Order, Hints,
+ MF, VRM);
+ }
+
if (!ST.hasSME() || !ST.isStreaming())
return TargetRegisterInfo::getRegAllocationHints(VirtReg, Order, Hints, MF,
VRM);
@@ -1153,8 +1212,7 @@ bool AArch64RegisterInfo::getRegAllocationHints(
// FORM_TRANSPOSED_REG_TUPLE pseudo, we want to favour reducing copy
// instructions over reducing the number of clobbered callee-save registers,
// so we add the strided registers as a hint.
- const MachineRegisterInfo &MRI = MF.getRegInfo();
- unsigned RegID = MRI.getRegClass(VirtReg)->getID();
+ unsigned RegID = RegRC->getID();
if (RegID == AArch64::ZPR2StridedOrContiguousRegClassID ||
RegID == AArch64::ZPR4StridedOrContiguousRegClassID) {
diff --git a/llvm/test/CodeGen/AArch64/aarch64-combine-add-sub-mul.ll b/llvm/test/CodeGen/AArch64/aarch64-combine-add-sub-mul.ll
index e086ab92421fb..33ea74912251e 100644
--- a/llvm/test/CodeGen/AArch64/aarch64-combine-add-sub-mul.ll
+++ b/llvm/test/CodeGen/AArch64/aarch64-combine-add-sub-mul.ll
@@ -52,12 +52,11 @@ define <2 x i64> @test_mul_sub_2x64_2(<2 x i64> %a, <2 x i64> %b, <2 x i64> %c,
; CHECK-NEXT: ptrue p0.d, vl2
; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
-; CHECK-NEXT: // kill: def $q3 killed $q3 def $z3
; CHECK-NEXT: // kill: def $q2 killed $q2 def $z2
+; CHECK-NEXT: // kill: def $q3 killed $q3 def $z3
; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
-; CHECK-NEXT: movprfx z1, z2
-; CHECK-NEXT: mul z1.d, p0/m, z1.d, z3.d
-; CHECK-NEXT: sub v0.2d, v1.2d, v0.2d
+; CHECK-NEXT: mul z2.d, p0/m, z2.d, z3.d
+; CHECK-NEXT: sub v0.2d, v2.2d, v0.2d
; CHECK-NEXT: ret
%div = sdiv <2 x i64> %a, %b
%mul = mul <2 x i64> %c, %d
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-contract.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-contract.ll
index 533e831de0df8..258eaabee9376 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-contract.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-contract.ll
@@ -14,13 +14,12 @@ define <vscale x 4 x double> @mull_add(<vscale x 4 x double> %a, <vscale x 4 x d
; CHECK-NEXT: ptrue p0.d
; CHECK-NEXT: fmul z7.d, z0.d, z1.d
; CHECK-NEXT: fmul z1.d, z6.d, z1.d
-; CHECK-NEXT: movprfx z3, z7
-; CHECK-NEXT: fmla z3.d, p0/m, z6.d, z2.d
+; CHECK-NEXT: fmad z6.d, p0/m, z2.d, z7.d
; CHECK-NEXT: fnmsb z0.d, p0/m, z2.d, z1.d
; CHECK-NEXT: uzp2 z1.d, z4.d, z5.d
; CHECK-NEXT: uzp1 z2.d, z4.d, z5.d
; CHECK-NEXT: fadd z2.d, z2.d, z0.d
-; CHECK-NEXT: fadd z1.d, z3.d, z1.d
+; CHECK-NEXT: fadd z1.d, z6.d, z1.d
; CHECK-NEXT: zip1 z0.d, z2.d, z1.d
; CHECK-NEXT: zip2 z1.d, z2.d, z1.d
; CHECK-NEXT: ret
@@ -225,17 +224,14 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
; CHECK-NEXT: fmul z1.d, z25.d, z1.d
; CHECK-NEXT: fmul z3.d, z4.d, z24.d
; CHECK-NEXT: fmul z24.d, z5.d, z24.d
-; CHECK-NEXT: movprfx z7, z26
-; CHECK-NEXT: fmla z7.d, p0/m, z25.d, z2.d
+; CHECK-NEXT: fmad z25.d, p0/m, z2.d, z26.d
; CHECK-NEXT: fnmsb z0.d, p0/m, z2.d, z1.d
-; CHECK-NEXT: movprfx z1, z3
-; CHECK-NEXT: fmla z1.d, p0/m, z6.d, z5.d
-; CHECK-NEXT: movprfx z2, z24
-; CHECK-NEXT: fnmls z2.d, p0/m, z4.d, z6.d
-; CHECK-NEXT: fadd z2.d, z0.d, z2.d
-; CHECK-NEXT: fadd z1.d, z7.d, z1.d
-; CHECK-NEXT: zip1 z0.d, z2.d, z1.d
-; CHECK-NEXT: zip2 z1.d, z2.d, z1.d
+; CHECK-NEXT: fmla z3.d, p0/m, z6.d, z5.d
+; CHECK-NEXT: fnmsb z4.d, p0/m, z6.d, z24.d
+; CHECK-NEXT: fadd z1.d, z0.d, z4.d
+; CHECK-NEXT: fadd z2.d, z25.d, z3.d
+; CHECK-NEXT: zip1 z0.d, z1.d, z2.d
+; CHECK-NEXT: zip2 z1.d, z1.d, z2.d
; CHECK-NEXT: ret
entry:
%strided.vec = tail call { <vscale x 2 x double>, <vscale x 2 x double> } @llvm.vector.deinterleave2.nxv4f64(<vscale x 4 x double> %a)
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
index 1eed9722f57be..b68c0094f84de 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-add-mull-scalable-fast.ll
@@ -200,12 +200,10 @@ define <vscale x 4 x double> @mul_add_rot_mull(<vscale x 4 x double> %a, <vscale
; CHECK-NEXT: fmul z3.d, z2.d, z25.d
; CHECK-NEXT: fmul z25.d, z24.d, z25.d
; CHECK-NEXT: fmla z3.d, p0/m, z24.d, z0.d
-; CHECK-NEXT: movprfx z24, z25
-; CHECK-NEXT: fmla z24.d, p0/m, z26.d, z1.d
-; CHECK-NEXT: movprfx z6, z24
-; CHECK-NEXT: fmla z6.d, p0/m, z5.d, z4.d
+; CHECK-NEXT: fmla z25.d, p0/m, z26.d, z1.d
+; CHECK-NEXT: fmla z25.d, p0/m, z5.d, z4.d
; CHECK-NEXT: fmla z3.d, p0/m, z26.d, z4.d
-; CHECK-NEXT: fnmsb z2.d, p0/m, z0.d, z6.d
+; CHECK-NEXT: fnmsb z2.d, p0/m, z0.d, z25.d
; CHECK-NEXT: fmsb z1.d, p0/m, z5.d, z3.d
; CHECK-NEXT: zip1 z0.d, z2.d, z1.d
; CHECK-NEXT: zip2 z1.d, z2.d, z1.d
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add-scalable.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add-scalable.ll
index c2fc959d8e101..583391cd22ef7 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add-scalable.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add-scalable.ll
@@ -17,11 +17,10 @@ define <vscale x 4 x half> @complex_add_v4f16(<vscale x 4 x half> %a, <vscale x
; CHECK-NEXT: uunpklo z3.d, z3.s
; CHECK-NEXT: uunpklo z1.d, z1.s
; CHECK-NEXT: fsubr z0.h, p0/m, z0.h, z1.h
-; CHECK-NEXT: movprfx z1, z3
-; CHECK-NEXT: fadd z1.h, p0/m, z1.h, z2.h
-; CHECK-NEXT: zip2 z2.d, z0.d, z1.d
-; CHECK-NEXT: zip1 z0.d, z0.d, z1.d
-; CHECK-NEXT: uzp1 z0.s, z0.s, z2.s
+; CHECK-NEXT: fadd z2.h, p0/m, z2.h, z3.h
+; CHECK-NEXT: zip2 z1.d, z0.d, z2.d
+; CHECK-NEXT: zip1 z0.d, z0.d, z2.d
+; CHECK-NEXT: uzp1 z0.s, z0.s, z1.s
; CHECK-NEXT: ret
entry:
%a.deinterleaved = tail call { <vscale x 2 x half>, <vscale x 2 x half> } @llvm.vector.deinterleave2.nxv4f16(<vscale x 4 x half> %a)
diff --git a/llvm/test/CodeGen/AArch64/complex-deinterleaving-i16-mul-scalable.ll b/llvm/test/CodeGen/AArch64/complex-deinterleaving-i16-mul-scalable.ll
index 061fd07489284..00b0095e4309c 100644
--- a/llvm/test/CodeGen/AArch64/complex-deinterleaving-i16-mul-scalable.ll
+++ b/llvm/test/CodeGen/AArch64/complex-deinterleaving-i16-mul-scalable.ll
@@ -18,11 +18,10 @@ define <vscale x 4 x i16> @complex_mul_v4i16(<vscale x 4 x i16> %a, <vscale x 4
; CHECK-NEXT: uzp2 z1.d, z1.d, z3.d
; CHECK-NEXT: mul z5.d, z2.d, z0.d
; CHECK-NEXT: mul z2.d, z2.d, z4.d
-; CHECK-NEXT: movprfx z3, z5
-; CHECK-NEXT: mla z3.d, p0/m, z1.d, z4.d
+; CHECK-NEXT: mad z4.d, p0/m, z1.d, z5.d
; CHECK-NEXT: msb z0.d, p0/m, z1.d, z2.d
-; CHECK-NEXT: zip2 z1.d, z0.d, z3.d
-; CHECK-NEXT: zip1 z0.d, z0.d, z3.d
+; CHECK-NEXT: zip2 z1.d, z0.d, z4.d
+; CHECK-NEXT: zip1 z0.d, z0.d, z4.d
; CHECK-NEXT: uzp1 z0.s, z0.s, z1.s
; CHECK-NEXT: ret
entry:
diff --git a/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll b/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
index 47fae5a01c931..f0abbaac2e68c 100644
--- a/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
+++ b/llvm/test/CodeGen/AArch64/llvm-ir-to-intrinsic.ll
@@ -1148,11 +1148,10 @@ define <vscale x 4 x i64> @fshl_rot_illegal_i64(<vscale x 4 x i64> %a, <vscale x
; CHECK-NEXT: and z3.d, z3.d, #0x3f
; CHECK-NEXT: lslr z4.d, p0/m, z4.d, z0.d
; CHECK-NEXT: lsr z0.d, p0/m, z0.d, z2.d
-; CHECK-NEXT: movprfx z2, z1
-; CHECK-NEXT: lsl z2.d, p0/m, z2.d, z5.d
+; CHECK-NEXT: lslr z5.d, p0/m, z5.d, z1.d
; CHECK-NEXT: lsr z1.d, p0/m, z1.d, z3.d
; CHECK-NEXT: orr z0.d, z4.d, z0.d
-; CHECK-NEXT: orr z1.d, z2.d, z1.d
+; CHECK-NEXT: orr z1.d, z5.d, z1.d
; CHECK-NEXT: ret
%fshl = call <vscale x 4 x i64> @llvm.fshl.nxv4i64(<vscale x 4 x i64> %a, <vscale x 4 x i64> %a, <vscale x 4 x i64> %b)
ret <vscale x 4 x i64> %fshl
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
index 6fbae7edfec0a..2dda03e5c6dab 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-arith.ll
@@ -55,10 +55,9 @@ define void @fadd_v32f16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: fadd z0.h, p0/m, z0.h, z1.h
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fadd z1.h, p0/m, z1.h, z3.h
+; VBITS_GE_256-NEXT: fadd z2.h, p0/m, z2.h, z3.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
+; VBITS_GE_256-NEXT: st1h { z2.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fadd_v32f16:
@@ -154,10 +153,9 @@ define void @fadd_v16f32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: fadd z0.s, p0/m, z0.s, z1.s
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fadd z1.s, p0/m, z1.s, z3.s
+; VBITS_GE_256-NEXT: fadd z2.s, p0/m, z2.s, z3.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
+; VBITS_GE_256-NEXT: st1w { z2.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fadd_v16f32:
@@ -253,10 +251,9 @@ define void @fadd_v8f64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: fadd z0.d, p0/m, z0.d, z1.d
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fadd z1.d, p0/m, z1.d, z3.d
+; VBITS_GE_256-NEXT: fadd z2.d, p0/m, z2.d, z3.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
+; VBITS_GE_256-NEXT: st1d { z2.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fadd_v8f64:
@@ -660,10 +657,9 @@ define void @fma_v32f16(ptr %a, ptr %b, ptr %c) #0 {
; VBITS_GE_256-NEXT: ld1h { z4.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: ld1h { z5.h }, p0/z, [x2]
; VBITS_GE_256-NEXT: fmad z0.h, p0/m, z1.h, z2.h
-; VBITS_GE_256-NEXT: movprfx z1, z5
-; VBITS_GE_256-NEXT: fmla z1.h, p0/m, z3.h, z4.h
+; VBITS_GE_256-NEXT: fmad z3.h, p0/m, z4.h, z5.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
+; VBITS_GE_256-NEXT: st1h { z3.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fma_v32f16:
@@ -771,10 +767,9 @@ define void @fma_v16f32(ptr %a, ptr %b, ptr %c) #0 {
; VBITS_GE_256-NEXT: ld1w { z4.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: ld1w { z5.s }, p0/z, [x2]
; VBITS_GE_256-NEXT: fmad z0.s, p0/m, z1.s, z2.s
-; VBITS_GE_256-NEXT: movprfx z1, z5
-; VBITS_GE_256-NEXT: fmla z1.s, p0/m, z3.s, z4.s
+; VBITS_GE_256-NEXT: fmad z3.s, p0/m, z4.s, z5.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
+; VBITS_GE_256-NEXT: st1w { z3.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fma_v16f32:
@@ -881,10 +876,9 @@ define void @fma_v8f64(ptr %a, ptr %b, ptr %c) #0 {
; VBITS_GE_256-NEXT: ld1d { z4.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: ld1d { z5.d }, p0/z, [x2]
; VBITS_GE_256-NEXT: fmad z0.d, p0/m, z1.d, z2.d
-; VBITS_GE_256-NEXT: movprfx z1, z5
-; VBITS_GE_256-NEXT: fmla z1.d, p0/m, z3.d, z4.d
+; VBITS_GE_256-NEXT: fmad z3.d, p0/m, z4.d, z5.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
+; VBITS_GE_256-NEXT: st1d { z3.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fma_v8f64:
@@ -990,10 +984,9 @@ define void @fmul_v32f16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: fmul z0.h, p0/m, z0.h, z1.h
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fmul z1.h, p0/m, z1.h, z3.h
+; VBITS_GE_256-NEXT: fmul z2.h, p0/m, z2.h, z3.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
+; VBITS_GE_256-NEXT: st1h { z2.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fmul_v32f16:
@@ -1089,10 +1082,9 @@ define void @fmul_v16f32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: fmul z0.s, p0/m, z0.s, z1.s
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fmul z1.s, p0/m, z1.s, z3.s
+; VBITS_GE_256-NEXT: fmul z2.s, p0/m, z2.s, z3.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
+; VBITS_GE_256-NEXT: st1w { z2.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fmul_v16f32:
@@ -1188,10 +1180,9 @@ define void @fmul_v8f64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: fmul z0.d, p0/m, z0.d, z1.d
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fmul z1.d, p0/m, z1.d, z3.d
+; VBITS_GE_256-NEXT: fmul z2.d, p0/m, z2.d, z3.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
+; VBITS_GE_256-NEXT: st1d { z2.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fmul_v8f64:
@@ -1827,10 +1818,9 @@ define void @fsub_v32f16(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
; VBITS_GE_256-NEXT: fsub z0.h, p0/m, z0.h, z1.h
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fsub z1.h, p0/m, z1.h, z3.h
+; VBITS_GE_256-NEXT: fsub z2.h, p0/m, z2.h, z3.h
; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
+; VBITS_GE_256-NEXT: st1h { z2.h }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fsub_v32f16:
@@ -1926,10 +1916,9 @@ define void @fsub_v16f32(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
; VBITS_GE_256-NEXT: fsub z0.s, p0/m, z0.s, z1.s
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fsub z1.s, p0/m, z1.s, z3.s
+; VBITS_GE_256-NEXT: fsub z2.s, p0/m, z2.s, z3.s
; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
+; VBITS_GE_256-NEXT: st1w { z2.s }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fsub_v16f32:
@@ -2025,10 +2014,9 @@ define void @fsub_v8f64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x0]
; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
; VBITS_GE_256-NEXT: fsub z0.d, p0/m, z0.d, z1.d
-; VBITS_GE_256-NEXT: movprfx z1, z2
-; VBITS_GE_256-NEXT: fsub z1.d, p0/m, z1.d, z3.d
+; VBITS_GE_256-NEXT: fsub z2.d, p0/m, z2.d, z3.d
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
-; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
+; VBITS_GE_256-NEXT: st1d { z2.d }, p0, [x0]
; VBITS_GE_256-NEXT: ret
;
; VBITS_GE_512-LABEL: fsub_v8f64:
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-fma.ll
index e1ec5ee5f6137..633b429db...
[truncated]
|
| case AArch64::DestructiveTernaryCommWithRev: | ||
| AddHintIfSuitable(R, Def.getOperand(2)); | ||
| AddHintIfSuitable(R, Def.getOperand(3)); | ||
| AddHintIfSuitable(R, Def.getOperand(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember if there is any priority order for hints? E.g. will R, Def.getOperand(2) be considered first for assigning a phys reg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here adds hints using the priority order from ArrayRef<MCPhysReg> Order, so the order in which it calls AddHintIfSuitable (for each Def.getOperand(K)) does not matter.
In general, the priority order of hints does matter, as the register allocator will try the hints in the order specified.
|
|
||
| // For predicated SVE instructions where the inactive lanes are undef, | ||
| // pick a destination register that is not unique to avoid introducing | ||
| // a movprfx to copy a unique register to the destination operand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super-nit: a destination register that is not unique I guess is true for commutative instructions, where we do not really care which reg is chosen. For non-commutative ones, I think we will specifically hint that the destination register should be the same as the "destructed input".
I don't know how to rephrase the comment better though :D
| unsigned Opcode = AArch64::getSVEPseudoMap(Def.getOpcode()); | ||
| switch (TII->get(Opcode).TSFlags & AArch64::DestructiveInstTypeMask) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could perhaps move this outside the Order loop to avoid repeating some computations (for example, having a SmallVector for the indices or registers of Def operands, and iterating over those here instead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do something like this:
// Try to add register as hint if the register is not unique.
auto TryAddRegHints = [&](ArrayRef<unsigned> OpIndices) {
for (MCPhysReg R : Order) {
for (unsigned OpIdx : OpIndices) {
const MachineOperand &MO = Def.getOperand(OpIdx);
// R is a suitable register hint if there exists an operand for the
// instruction that is not yet allocated a register or if R matches
// one of the other source operands.
if (!VRM->hasPhys(MO.getReg()) || VRM->getPhys(MO.getReg()) == R)
Hints.push_back(R);
}
}
};
unsigned Opcode = AArch64::getSVEPseudoMap(Def.getOpcode());
switch (TII->get(Opcode).TSFlags & AArch64::DestructiveInstTypeMask) {
default:
break;
case AArch64::DestructiveTernaryCommWithRev:
TryAddRegHints({2, 3, 4});
break;
case AArch64::DestructiveBinaryComm:
case AArch64::DestructiveBinaryCommWithRev:
TryAddRegHints({2, 3});
break;
case AArch64::DestructiveBinary:
case AArch64::DestructiveBinaryImm:
TryAddRegHints({2});
break;
}
But personally I find the current code slightly easier to read, as there's no need to pass in a somewhat opaque {2, 3, 4} list to the lambda, meaning that I might need to find a better name than TryAddRegHints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something like this:
SmallVector<unsigned, 4> OpIndices;
unsigned Opcode = AArch64::getSVEPseudoMap(Def.getOpcode());
switch (TII->get(Opcode).TSFlags & AArch64::DestructiveInstTypeMask) {
default:
break;
case AArch64::DestructiveTernaryCommWithRev:
OpIndices = {2, 3, 4};
break;
case AArch64::DestructiveBinaryComm:
case AArch64::DestructiveBinaryCommWithRev:
OpIndices = {2, 3};
break;
case AArch64::DestructiveBinary:
case AArch64::DestructiveBinaryImm:
OpIndices = {2};
break;
}
for (MCPhysReg R : Order) {
for (unsigned OpIdx : OpIndices) {
const MachineOperand &MO = Def.getOperand(OpIdx);
// R is a suitable register hint if there exists an operand for the
// instruction that is not yet allocated a register or if R matches
// one of the other source operands.
if (!VRM->hasPhys(MO.getReg()) || VRM->getPhys(MO.getReg()) == R)
Hints.push_back(R);
}
}
If you think this is more confusing than the current code or is not worth it for any other reason, please feel free to disregard my suggestion.
| if (Hints.size()) | ||
| return TargetRegisterInfo::getRegAllocationHints(VirtReg, Order, Hints, | ||
| MF, VRM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to prefer adding the hints above before target-independent ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment of getRegAllocationHints probably explains it best: https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/TargetRegisterInfo.h#L986-L999
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really sorry if I'm missing something very obvious, but is the expectation that the hints added above should take precedence over the copy hints that the target-independent implementation adds?
For predicated SVE instructions where we know that the inactive lanes are undef, it is better to pick a destination register that is not unique. This avoids introducing a movprfx to copy a unique register to the destination operand, which would be needed to comply with the tied operand constraints.
For example:
Here it is beneficial to pick $z1 or $z2 as the destination register, because if it would have chosen a unique register (e.g. $z0) then the pseudo expand pass would need to insert a MOVPRFX to expand the operation into:
By picking $z1 directly, we'd get: