-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[LegalizeTypes] Expand 128-bit UDIV/UREM by constant via Chunk Addition #146238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This patch improves the lowering of 128-bit unsigned division and remainder by constants (UDIV/UREM) by avoiding a fallback to libcall (__udivti3/uremti3) for specific divisors. When a divisor D satisfies the condition (1 << ChunkWidth) % D == 1, the 128-bit value is split into fixed-width chunks (e.g., 30-bit) and summed before applying a smaller UDIV/UREM. This transformation is based on the "remainder by summing digits" trick described in Hacker’s Delight. This fixes PR137514 for some constants.
@llvm/pr-subscribers-llvm-selectiondag Author: Shivam Gupta (xgupta) ChangesThis patch improves the lowering of 128-bit unsigned division and remainder by constants (UDIV/UREM) by avoiding a fallback to libcall (__udivti3/uremti3) for specific divisors. When a divisor D satisfies the condition (1 << ChunkWidth) % D == 1, the 128-bit value is split into fixed-width chunks (e.g., 30-bit) and summed before applying a smaller UDIV/UREM. This transformation is based on the "remainder by summing digits" trick described in Hacker’s Delight. This fixes PR137514 for some constants. Patch is 36.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146238.diff 8 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index a0b5f67c2e6c7..e8dc9280e8e2b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -7981,8 +7981,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
// If (1 << HBitWidth) % divisor == 1, we can add the two halves together and
// then add in the carry.
- // TODO: If we can't split it in half, we might be able to split into 3 or
- // more pieces using a smaller bit width.
if (HalfMaxPlus1.urem(Divisor).isOne()) {
assert(!LL == !LH && "Expected both input halves or no input halves!");
if (!LL)
@@ -8030,6 +8028,80 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
DAG.getConstant(0, dl, HiLoVT));
Sum = DAG.getNode(ISD::ADD, dl, HiLoVT, Sum, Carry);
}
+
+ } else {
+ // If we cannot split in two halves. Let's look for a smaller chunk
+ // width where (1 << ChunkWidth) mod Divisor == 1.
+ // This ensures that the sum of all such chunks modulo Divisor
+ // is equivalent to the original value modulo Divisor.
+ const APInt &Divisor = CN->getAPIntValue();
+ unsigned BitWidth = VT.getScalarSizeInBits();
+ unsigned BestChunkWidth = 0;
+
+ // We restrict to small chunk sizes (e.g., ≤ 32 bits) to ensure that all
+ // operations remain legal on most targets.
+ unsigned MaxChunk = 32;
+ for (int i = MaxChunk; i >= 1; --i) {
+ APInt ChunkMaxPlus1 = APInt::getOneBitSet(BitWidth, i);
+ if (ChunkMaxPlus1.urem(Divisor).isOne()) {
+ BestChunkWidth = i;
+ break;
+ }
+ }
+
+ // If we found a good chunk width, slice the number and sum the pieces.
+ if (BestChunkWidth > 0) {
+ EVT ChunkVT = EVT::getIntegerVT(*DAG.getContext(), BestChunkWidth);
+
+ if (!LL)
+ std::tie(LL, LH) =
+ DAG.SplitScalar(N->getOperand(0), dl, HiLoVT, HiLoVT);
+ SDValue In = DAG.getNode(ISD::BUILD_PAIR, dl, VT, LL, LH);
+
+ SmallVector<SDValue, 8> Parts;
+ // Split into fixed-size chunks
+ for (unsigned i = 0; i < BitWidth; i += BestChunkWidth) {
+ SDValue Shift = DAG.getShiftAmountConstant(i, VT, dl);
+ SDValue Chunk = DAG.getNode(ISD::SRL, dl, VT, In, Shift);
+ Chunk = DAG.getNode(ISD::TRUNCATE, dl, ChunkVT, Chunk);
+ Parts.push_back(Chunk);
+ }
+ if (Parts.empty())
+ return false;
+ Sum = Parts[0];
+
+ // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
+ // same logic as used in above if condition.
+ SDValue Carry = DAG.getConstant(0, dl, ChunkVT);
+ EVT SetCCType =
+ getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), ChunkVT);
+ for (unsigned i = 1; i < Parts.size(); ++i) {
+ if (isOperationLegalOrCustom(ISD::UADDO_CARRY, ChunkVT)) {
+ SDVTList VTList = DAG.getVTList(ChunkVT, SetCCType);
+ SDValue UAdd = DAG.getNode(ISD::UADDO, dl, VTList, Sum, Parts[i]);
+ Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, UAdd, Carry,
+ UAdd.getValue(1));
+ } else {
+ SDValue Add = DAG.getNode(ISD::ADD, dl, ChunkVT, Sum, Parts[i]);
+ SDValue NewCarry = DAG.getSetCC(dl, SetCCType, Add, Sum, ISD::SETULT);
+
+ if (getBooleanContents(ChunkVT) ==
+ TargetLoweringBase::ZeroOrOneBooleanContent)
+ NewCarry = DAG.getZExtOrTrunc(NewCarry, dl, ChunkVT);
+ else
+ NewCarry = DAG.getSelect(dl, ChunkVT, NewCarry,
+ DAG.getConstant(1, dl, ChunkVT),
+ DAG.getConstant(0, dl, ChunkVT));
+
+ Sum = DAG.getNode(ISD::ADD, dl, ChunkVT, Add, Carry);
+ Carry = NewCarry;
+ }
+ }
+
+ Sum = DAG.getNode(ISD::ZERO_EXTEND, dl, HiLoVT, Sum);
+ } else {
+ return false;
+ }
}
// If we didn't find a sum, we can't do the expansion.
diff --git a/llvm/test/CodeGen/RISCV/div-by-constant.ll b/llvm/test/CodeGen/RISCV/div-by-constant.ll
index ea8b04d727acf..f2e1979fc4057 100644
--- a/llvm/test/CodeGen/RISCV/div-by-constant.ll
+++ b/llvm/test/CodeGen/RISCV/div-by-constant.ll
@@ -115,16 +115,76 @@ define i64 @udiv64_constant_no_add(i64 %a) nounwind {
}
define i64 @udiv64_constant_add(i64 %a) nounwind {
-; RV32-LABEL: udiv64_constant_add:
-; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 7
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __udivdi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
-; RV32-NEXT: ret
+; RV32IM-LABEL: udiv64_constant_add:
+; RV32IM: # %bb.0:
+; RV32IM-NEXT: lui a2, 262144
+; RV32IM-NEXT: slli a3, a1, 2
+; RV32IM-NEXT: srli a4, a0, 30
+; RV32IM-NEXT: srli a5, a1, 28
+; RV32IM-NEXT: lui a6, 149797
+; RV32IM-NEXT: addi a2, a2, -1
+; RV32IM-NEXT: or a3, a4, a3
+; RV32IM-NEXT: and a4, a0, a2
+; RV32IM-NEXT: add a3, a0, a3
+; RV32IM-NEXT: add a5, a3, a5
+; RV32IM-NEXT: and a3, a3, a2
+; RV32IM-NEXT: sltu a3, a3, a4
+; RV32IM-NEXT: lui a4, 449390
+; RV32IM-NEXT: add a3, a5, a3
+; RV32IM-NEXT: lui a5, 748983
+; RV32IM-NEXT: addi a6, a6, -1755
+; RV32IM-NEXT: addi a4, a4, -1171
+; RV32IM-NEXT: addi a5, a5, -585
+; RV32IM-NEXT: and a2, a3, a2
+; RV32IM-NEXT: mulhu a3, a2, a6
+; RV32IM-NEXT: slli a6, a3, 3
+; RV32IM-NEXT: add a2, a2, a3
+; RV32IM-NEXT: sub a2, a2, a6
+; RV32IM-NEXT: sub a3, a0, a2
+; RV32IM-NEXT: sltu a0, a0, a2
+; RV32IM-NEXT: mul a2, a3, a4
+; RV32IM-NEXT: mulhu a4, a3, a5
+; RV32IM-NEXT: sub a1, a1, a0
+; RV32IM-NEXT: add a2, a4, a2
+; RV32IM-NEXT: mul a1, a1, a5
+; RV32IM-NEXT: add a1, a2, a1
+; RV32IM-NEXT: mul a0, a3, a5
+; RV32IM-NEXT: ret
+;
+; RV32IMZB-LABEL: udiv64_constant_add:
+; RV32IMZB: # %bb.0:
+; RV32IMZB-NEXT: srli a2, a0, 30
+; RV32IMZB-NEXT: srli a3, a1, 28
+; RV32IMZB-NEXT: lui a4, 786432
+; RV32IMZB-NEXT: slli a5, a0, 2
+; RV32IMZB-NEXT: lui a6, 149797
+; RV32IMZB-NEXT: sh2add a2, a1, a2
+; RV32IMZB-NEXT: srli a5, a5, 2
+; RV32IMZB-NEXT: add a2, a0, a2
+; RV32IMZB-NEXT: add a3, a2, a3
+; RV32IMZB-NEXT: andn a2, a2, a4
+; RV32IMZB-NEXT: sltu a2, a2, a5
+; RV32IMZB-NEXT: lui a5, 449390
+; RV32IMZB-NEXT: add a2, a3, a2
+; RV32IMZB-NEXT: lui a3, 748983
+; RV32IMZB-NEXT: addi a6, a6, -1755
+; RV32IMZB-NEXT: addi a5, a5, -1171
+; RV32IMZB-NEXT: addi a3, a3, -585
+; RV32IMZB-NEXT: andn a2, a2, a4
+; RV32IMZB-NEXT: mulhu a4, a2, a6
+; RV32IMZB-NEXT: slli a6, a4, 3
+; RV32IMZB-NEXT: add a2, a2, a4
+; RV32IMZB-NEXT: sub a2, a2, a6
+; RV32IMZB-NEXT: sub a4, a0, a2
+; RV32IMZB-NEXT: sltu a0, a0, a2
+; RV32IMZB-NEXT: mul a2, a4, a5
+; RV32IMZB-NEXT: mulhu a5, a4, a3
+; RV32IMZB-NEXT: sub a1, a1, a0
+; RV32IMZB-NEXT: add a2, a5, a2
+; RV32IMZB-NEXT: mul a1, a1, a3
+; RV32IMZB-NEXT: add a1, a2, a1
+; RV32IMZB-NEXT: mul a0, a4, a3
+; RV32IMZB-NEXT: ret
;
; RV64-LABEL: udiv64_constant_add:
; RV64: # %bb.0:
diff --git a/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll b/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
index eb70d7f43c0ef..8250fc3a176e2 100644
--- a/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
+++ b/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
@@ -117,24 +117,89 @@ define iXLen2 @test_udiv_5(iXLen2 %x) nounwind {
define iXLen2 @test_udiv_7(iXLen2 %x) nounwind {
; RV32-LABEL: test_udiv_7:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 7
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __udivdi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a5, a1, 28
+; RV32-NEXT: lui a6, 149797
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: and a4, a0, a2
+; RV32-NEXT: add a3, a0, a3
+; RV32-NEXT: add a5, a3, a5
+; RV32-NEXT: and a3, a3, a2
+; RV32-NEXT: sltu a3, a3, a4
+; RV32-NEXT: lui a4, 449390
+; RV32-NEXT: add a3, a5, a3
+; RV32-NEXT: lui a5, 748983
+; RV32-NEXT: addi a6, a6, -1755
+; RV32-NEXT: addi a4, a4, -1171
+; RV32-NEXT: addi a5, a5, -585
+; RV32-NEXT: and a2, a3, a2
+; RV32-NEXT: mulhu a3, a2, a6
+; RV32-NEXT: slli a6, a3, 3
+; RV32-NEXT: add a2, a2, a3
+; RV32-NEXT: sub a2, a2, a6
+; RV32-NEXT: sub a3, a0, a2
+; RV32-NEXT: sltu a0, a0, a2
+; RV32-NEXT: mul a2, a3, a4
+; RV32-NEXT: mulhu a4, a3, a5
+; RV32-NEXT: sub a1, a1, a0
+; RV32-NEXT: add a2, a4, a2
+; RV32-NEXT: mul a1, a1, a5
+; RV32-NEXT: add a1, a2, a1
+; RV32-NEXT: mul a0, a3, a5
; RV32-NEXT: ret
;
; RV64-LABEL: test_udiv_7:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 7
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __udivti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: srli a7, a1, 26
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: lui a3, 748983
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: addi a3, a3, -585
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: slli a5, a3, 33
+; RV64-NEXT: add a3, a3, a5
+; RV64-NEXT: and a5, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a5, a4, a5
+; RV64-NEXT: add a5, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: srli a4, a1, 56
+; RV64-NEXT: add a2, a2, a4
+; RV64-NEXT: lui a4, %hi(.LCPI2_0)
+; RV64-NEXT: add a7, a5, a7
+; RV64-NEXT: and a5, a5, a6
+; RV64-NEXT: add a2, a7, a2
+; RV64-NEXT: and a7, a7, a6
+; RV64-NEXT: sltu a5, a7, a5
+; RV64-NEXT: lui a7, %hi(.LCPI2_1)
+; RV64-NEXT: ld a4, %lo(.LCPI2_0)(a4)
+; RV64-NEXT: ld a7, %lo(.LCPI2_1)(a7)
+; RV64-NEXT: add a2, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: mulhu a4, a2, a4
+; RV64-NEXT: slli a5, a4, 3
+; RV64-NEXT: add a2, a2, a4
+; RV64-NEXT: sub a2, a2, a5
+; RV64-NEXT: sub a4, a0, a2
+; RV64-NEXT: sltu a0, a0, a2
+; RV64-NEXT: mul a2, a4, a7
+; RV64-NEXT: mulhu a5, a4, a3
+; RV64-NEXT: sub a1, a1, a0
+; RV64-NEXT: add a2, a5, a2
+; RV64-NEXT: mul a1, a1, a3
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: mul a0, a4, a3
; RV64-NEXT: ret
%a = udiv iXLen2 %x, 7
ret iXLen2 %a
@@ -143,24 +208,86 @@ define iXLen2 @test_udiv_7(iXLen2 %x) nounwind {
define iXLen2 @test_udiv_9(iXLen2 %x) nounwind {
; RV32-LABEL: test_udiv_9:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 9
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __udivdi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a5, a1, 28
+; RV32-NEXT: lui a6, 233017
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: and a4, a0, a2
+; RV32-NEXT: add a3, a0, a3
+; RV32-NEXT: add a5, a3, a5
+; RV32-NEXT: and a3, a3, a2
+; RV32-NEXT: sltu a3, a3, a4
+; RV32-NEXT: lui a4, 582542
+; RV32-NEXT: addi a6, a6, -455
+; RV32-NEXT: addi a4, a4, 910
+; RV32-NEXT: add a3, a5, a3
+; RV32-NEXT: and a2, a3, a2
+; RV32-NEXT: mulhu a3, a2, a6
+; RV32-NEXT: srli a3, a3, 1
+; RV32-NEXT: slli a5, a3, 3
+; RV32-NEXT: sub a2, a2, a3
+; RV32-NEXT: sub a2, a2, a5
+; RV32-NEXT: sub a3, a0, a2
+; RV32-NEXT: sltu a0, a0, a2
+; RV32-NEXT: mul a2, a3, a4
+; RV32-NEXT: mulhu a4, a3, a6
+; RV32-NEXT: sub a1, a1, a0
+; RV32-NEXT: add a2, a4, a2
+; RV32-NEXT: mul a1, a1, a6
+; RV32-NEXT: add a1, a2, a1
+; RV32-NEXT: mul a0, a3, a6
; RV32-NEXT: ret
;
; RV64-LABEL: test_udiv_9:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 9
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __udivti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: srli a7, a1, 26
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: srli a3, a1, 56
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: and a5, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a5, a4, a5
+; RV64-NEXT: add a5, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: lui a4, %hi(.LCPI3_0)
+; RV64-NEXT: add a2, a2, a3
+; RV64-NEXT: lui a3, %hi(.LCPI3_1)
+; RV64-NEXT: add a7, a5, a7
+; RV64-NEXT: and a5, a5, a6
+; RV64-NEXT: add a2, a7, a2
+; RV64-NEXT: and a7, a7, a6
+; RV64-NEXT: sltu a5, a7, a5
+; RV64-NEXT: lui a7, %hi(.LCPI3_2)
+; RV64-NEXT: ld a4, %lo(.LCPI3_0)(a4)
+; RV64-NEXT: ld a3, %lo(.LCPI3_1)(a3)
+; RV64-NEXT: ld a7, %lo(.LCPI3_2)(a7)
+; RV64-NEXT: add a2, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: mulhu a4, a2, a4
+; RV64-NEXT: slli a5, a4, 3
+; RV64-NEXT: sub a2, a2, a4
+; RV64-NEXT: sub a2, a2, a5
+; RV64-NEXT: sub a4, a0, a2
+; RV64-NEXT: sltu a0, a0, a2
+; RV64-NEXT: mul a2, a4, a3
+; RV64-NEXT: mulhu a3, a4, a7
+; RV64-NEXT: sub a1, a1, a0
+; RV64-NEXT: add a2, a3, a2
+; RV64-NEXT: mul a1, a1, a7
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: mul a0, a4, a7
; RV64-NEXT: ret
%a = udiv iXLen2 %x, 9
ret iXLen2 %a
diff --git a/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll b/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll
index bc4a99a00ac64..1680ea7d8da30 100644
--- a/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll
+++ b/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll
@@ -79,24 +79,63 @@ define iXLen2 @test_urem_5(iXLen2 %x) nounwind {
define iXLen2 @test_urem_7(iXLen2 %x) nounwind {
; RV32-LABEL: test_urem_7:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 7
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __umoddi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a1, a1, 28
+; RV32-NEXT: lui a5, 149797
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: addi a4, a5, -1755
+; RV32-NEXT: and a5, a0, a2
+; RV32-NEXT: add a0, a0, a3
+; RV32-NEXT: and a3, a0, a2
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: sltu a1, a3, a5
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: and a0, a0, a2
+; RV32-NEXT: mulhu a1, a0, a4
+; RV32-NEXT: slli a2, a1, 3
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: sub a0, a0, a2
+; RV32-NEXT: li a1, 0
; RV32-NEXT: ret
;
; RV64-LABEL: test_urem_7:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 7
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __umodti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: srli a3, a1, 26
+; RV64-NEXT: srli a1, a1, 56
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: lui a5, %hi(.LCPI2_0)
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: ld a5, %lo(.LCPI2_0)(a5)
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a0, a4, a0
+; RV64-NEXT: add a0, a2, a0
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: and a4, a0, a6
+; RV64-NEXT: add a0, a0, a3
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: and a2, a0, a6
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: sltu a1, a2, a4
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: mulhu a1, a0, a5
+; RV64-NEXT: slli a2, a1, 3
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: sub a0, a0, a2
+; RV64-NEXT: li a1, 0
; RV64-NEXT: ret
%a = urem iXLen2 %x, 7
ret iXLen2 %a
@@ -105,24 +144,64 @@ define iXLen2 @test_urem_7(iXLen2 %x) nounwind {
define iXLen2 @test_urem_9(iXLen2 %x) nounwind {
; RV32-LABEL: test_urem_9:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 9
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __umoddi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a1, a1, 28
+; RV32-NEXT: lui a5, 233017
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: addi a4, a5, -455
+; RV32-NEXT: and a5, a0, a2
+; RV32-NEXT: add a0, a0, a3
+; RV32-NEXT: and a3, a0, a2
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: sltu a1, a3, a5
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: and a0, a0, a2
+; RV32-NEXT: mulhu a1, a0, a4
+; RV32-NEXT: srli a1, a1, 1
+; RV32-NEXT: slli a2, a1, 3
+; RV32-NEXT: sub a0, a0, a1
+; RV32-NEXT: sub a0, a0, a2
+; RV32-NEXT: li a1, 0
; RV32-NEXT: ret
;
; RV64-LABEL: test_urem_9:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 9
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __umodti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: srli a3, a1, 26
+; RV64-NEXT: srli a1, a1, 56
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: lui a5, %hi(.LCPI3_0)
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: ld a5, %lo(.LCPI3_0)(a5)
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a0, a4, a0
+; RV64-NEXT: add a0, a2, a0
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: and a4, a0, a6
+; RV64-NEXT: add a0, a0, a3
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: and a2, a0, a6
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: sltu a1, a2, a4
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: mulhu a1, a0, a5
+; RV64-NEXT: slli a2, a1, 3
+; RV64-NEXT: sub a0, a0, a1
+; RV64-NEXT: sub a0, a0, a2
+; RV64-NEXT: li a1, 0
; RV64-NEXT: ret
%a = urem iXLen2 %x, 9
ret iXLen2 %a
diff --git a/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll b/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
index 3ef9f3f945108..77a026669c51d 100644
--- a/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
+++ b/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
@@ -862,51 +862,61 @@ define <4 x i64> @dont_fold_urem_i64(<4 x i64> %x) nounwind {
; RV32IM-NEXT: sw s5, 20(sp) # 4-byte Folded Spill
; RV32IM-NEXT: sw s6, 16(sp) # 4-byte Folded Spill
; RV32IM-NEXT: sw s7, 12(sp) # 4-byte Folded Spill
-; RV32IM-NEXT: sw s8, 8(sp) # 4-byte Folded Spill
-; RV32IM-NEXT: lw s1, 16(a1)
-; RV32IM-NEXT: lw s2, 20(a1)
-; RV32IM-NEXT: lw s3, 24(a1)
-; RV32IM-NEXT: lw s4, 28(a1)
-; RV32IM-NEXT: lw a3, 0(a1)
-; RV32IM-NEXT: lw a4, 4(a1)
-; RV32IM-NEXT: lw s5, 8(a1)
-; RV32IM-NEXT: lw s6, 12(...
[truncated]
|
@llvm/pr-subscribers-backend-x86 Author: Shivam Gupta (xgupta) ChangesThis patch improves the lowering of 128-bit unsigned division and remainder by constants (UDIV/UREM) by avoiding a fallback to libcall (__udivti3/uremti3) for specific divisors. When a divisor D satisfies the condition (1 << ChunkWidth) % D == 1, the 128-bit value is split into fixed-width chunks (e.g., 30-bit) and summed before applying a smaller UDIV/UREM. This transformation is based on the "remainder by summing digits" trick described in Hacker’s Delight. This fixes PR137514 for some constants. Patch is 36.18 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146238.diff 8 Files Affected:
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index a0b5f67c2e6c7..e8dc9280e8e2b 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -7981,8 +7981,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
// If (1 << HBitWidth) % divisor == 1, we can add the two halves together and
// then add in the carry.
- // TODO: If we can't split it in half, we might be able to split into 3 or
- // more pieces using a smaller bit width.
if (HalfMaxPlus1.urem(Divisor).isOne()) {
assert(!LL == !LH && "Expected both input halves or no input halves!");
if (!LL)
@@ -8030,6 +8028,80 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
DAG.getConstant(0, dl, HiLoVT));
Sum = DAG.getNode(ISD::ADD, dl, HiLoVT, Sum, Carry);
}
+
+ } else {
+ // If we cannot split in two halves. Let's look for a smaller chunk
+ // width where (1 << ChunkWidth) mod Divisor == 1.
+ // This ensures that the sum of all such chunks modulo Divisor
+ // is equivalent to the original value modulo Divisor.
+ const APInt &Divisor = CN->getAPIntValue();
+ unsigned BitWidth = VT.getScalarSizeInBits();
+ unsigned BestChunkWidth = 0;
+
+ // We restrict to small chunk sizes (e.g., ≤ 32 bits) to ensure that all
+ // operations remain legal on most targets.
+ unsigned MaxChunk = 32;
+ for (int i = MaxChunk; i >= 1; --i) {
+ APInt ChunkMaxPlus1 = APInt::getOneBitSet(BitWidth, i);
+ if (ChunkMaxPlus1.urem(Divisor).isOne()) {
+ BestChunkWidth = i;
+ break;
+ }
+ }
+
+ // If we found a good chunk width, slice the number and sum the pieces.
+ if (BestChunkWidth > 0) {
+ EVT ChunkVT = EVT::getIntegerVT(*DAG.getContext(), BestChunkWidth);
+
+ if (!LL)
+ std::tie(LL, LH) =
+ DAG.SplitScalar(N->getOperand(0), dl, HiLoVT, HiLoVT);
+ SDValue In = DAG.getNode(ISD::BUILD_PAIR, dl, VT, LL, LH);
+
+ SmallVector<SDValue, 8> Parts;
+ // Split into fixed-size chunks
+ for (unsigned i = 0; i < BitWidth; i += BestChunkWidth) {
+ SDValue Shift = DAG.getShiftAmountConstant(i, VT, dl);
+ SDValue Chunk = DAG.getNode(ISD::SRL, dl, VT, In, Shift);
+ Chunk = DAG.getNode(ISD::TRUNCATE, dl, ChunkVT, Chunk);
+ Parts.push_back(Chunk);
+ }
+ if (Parts.empty())
+ return false;
+ Sum = Parts[0];
+
+ // Use uaddo_carry if we can, otherwise use a compare to detect overflow.
+ // same logic as used in above if condition.
+ SDValue Carry = DAG.getConstant(0, dl, ChunkVT);
+ EVT SetCCType =
+ getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), ChunkVT);
+ for (unsigned i = 1; i < Parts.size(); ++i) {
+ if (isOperationLegalOrCustom(ISD::UADDO_CARRY, ChunkVT)) {
+ SDVTList VTList = DAG.getVTList(ChunkVT, SetCCType);
+ SDValue UAdd = DAG.getNode(ISD::UADDO, dl, VTList, Sum, Parts[i]);
+ Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, UAdd, Carry,
+ UAdd.getValue(1));
+ } else {
+ SDValue Add = DAG.getNode(ISD::ADD, dl, ChunkVT, Sum, Parts[i]);
+ SDValue NewCarry = DAG.getSetCC(dl, SetCCType, Add, Sum, ISD::SETULT);
+
+ if (getBooleanContents(ChunkVT) ==
+ TargetLoweringBase::ZeroOrOneBooleanContent)
+ NewCarry = DAG.getZExtOrTrunc(NewCarry, dl, ChunkVT);
+ else
+ NewCarry = DAG.getSelect(dl, ChunkVT, NewCarry,
+ DAG.getConstant(1, dl, ChunkVT),
+ DAG.getConstant(0, dl, ChunkVT));
+
+ Sum = DAG.getNode(ISD::ADD, dl, ChunkVT, Add, Carry);
+ Carry = NewCarry;
+ }
+ }
+
+ Sum = DAG.getNode(ISD::ZERO_EXTEND, dl, HiLoVT, Sum);
+ } else {
+ return false;
+ }
}
// If we didn't find a sum, we can't do the expansion.
diff --git a/llvm/test/CodeGen/RISCV/div-by-constant.ll b/llvm/test/CodeGen/RISCV/div-by-constant.ll
index ea8b04d727acf..f2e1979fc4057 100644
--- a/llvm/test/CodeGen/RISCV/div-by-constant.ll
+++ b/llvm/test/CodeGen/RISCV/div-by-constant.ll
@@ -115,16 +115,76 @@ define i64 @udiv64_constant_no_add(i64 %a) nounwind {
}
define i64 @udiv64_constant_add(i64 %a) nounwind {
-; RV32-LABEL: udiv64_constant_add:
-; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 7
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __udivdi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
-; RV32-NEXT: ret
+; RV32IM-LABEL: udiv64_constant_add:
+; RV32IM: # %bb.0:
+; RV32IM-NEXT: lui a2, 262144
+; RV32IM-NEXT: slli a3, a1, 2
+; RV32IM-NEXT: srli a4, a0, 30
+; RV32IM-NEXT: srli a5, a1, 28
+; RV32IM-NEXT: lui a6, 149797
+; RV32IM-NEXT: addi a2, a2, -1
+; RV32IM-NEXT: or a3, a4, a3
+; RV32IM-NEXT: and a4, a0, a2
+; RV32IM-NEXT: add a3, a0, a3
+; RV32IM-NEXT: add a5, a3, a5
+; RV32IM-NEXT: and a3, a3, a2
+; RV32IM-NEXT: sltu a3, a3, a4
+; RV32IM-NEXT: lui a4, 449390
+; RV32IM-NEXT: add a3, a5, a3
+; RV32IM-NEXT: lui a5, 748983
+; RV32IM-NEXT: addi a6, a6, -1755
+; RV32IM-NEXT: addi a4, a4, -1171
+; RV32IM-NEXT: addi a5, a5, -585
+; RV32IM-NEXT: and a2, a3, a2
+; RV32IM-NEXT: mulhu a3, a2, a6
+; RV32IM-NEXT: slli a6, a3, 3
+; RV32IM-NEXT: add a2, a2, a3
+; RV32IM-NEXT: sub a2, a2, a6
+; RV32IM-NEXT: sub a3, a0, a2
+; RV32IM-NEXT: sltu a0, a0, a2
+; RV32IM-NEXT: mul a2, a3, a4
+; RV32IM-NEXT: mulhu a4, a3, a5
+; RV32IM-NEXT: sub a1, a1, a0
+; RV32IM-NEXT: add a2, a4, a2
+; RV32IM-NEXT: mul a1, a1, a5
+; RV32IM-NEXT: add a1, a2, a1
+; RV32IM-NEXT: mul a0, a3, a5
+; RV32IM-NEXT: ret
+;
+; RV32IMZB-LABEL: udiv64_constant_add:
+; RV32IMZB: # %bb.0:
+; RV32IMZB-NEXT: srli a2, a0, 30
+; RV32IMZB-NEXT: srli a3, a1, 28
+; RV32IMZB-NEXT: lui a4, 786432
+; RV32IMZB-NEXT: slli a5, a0, 2
+; RV32IMZB-NEXT: lui a6, 149797
+; RV32IMZB-NEXT: sh2add a2, a1, a2
+; RV32IMZB-NEXT: srli a5, a5, 2
+; RV32IMZB-NEXT: add a2, a0, a2
+; RV32IMZB-NEXT: add a3, a2, a3
+; RV32IMZB-NEXT: andn a2, a2, a4
+; RV32IMZB-NEXT: sltu a2, a2, a5
+; RV32IMZB-NEXT: lui a5, 449390
+; RV32IMZB-NEXT: add a2, a3, a2
+; RV32IMZB-NEXT: lui a3, 748983
+; RV32IMZB-NEXT: addi a6, a6, -1755
+; RV32IMZB-NEXT: addi a5, a5, -1171
+; RV32IMZB-NEXT: addi a3, a3, -585
+; RV32IMZB-NEXT: andn a2, a2, a4
+; RV32IMZB-NEXT: mulhu a4, a2, a6
+; RV32IMZB-NEXT: slli a6, a4, 3
+; RV32IMZB-NEXT: add a2, a2, a4
+; RV32IMZB-NEXT: sub a2, a2, a6
+; RV32IMZB-NEXT: sub a4, a0, a2
+; RV32IMZB-NEXT: sltu a0, a0, a2
+; RV32IMZB-NEXT: mul a2, a4, a5
+; RV32IMZB-NEXT: mulhu a5, a4, a3
+; RV32IMZB-NEXT: sub a1, a1, a0
+; RV32IMZB-NEXT: add a2, a5, a2
+; RV32IMZB-NEXT: mul a1, a1, a3
+; RV32IMZB-NEXT: add a1, a2, a1
+; RV32IMZB-NEXT: mul a0, a4, a3
+; RV32IMZB-NEXT: ret
;
; RV64-LABEL: udiv64_constant_add:
; RV64: # %bb.0:
diff --git a/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll b/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
index eb70d7f43c0ef..8250fc3a176e2 100644
--- a/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
+++ b/llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
@@ -117,24 +117,89 @@ define iXLen2 @test_udiv_5(iXLen2 %x) nounwind {
define iXLen2 @test_udiv_7(iXLen2 %x) nounwind {
; RV32-LABEL: test_udiv_7:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 7
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __udivdi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a5, a1, 28
+; RV32-NEXT: lui a6, 149797
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: and a4, a0, a2
+; RV32-NEXT: add a3, a0, a3
+; RV32-NEXT: add a5, a3, a5
+; RV32-NEXT: and a3, a3, a2
+; RV32-NEXT: sltu a3, a3, a4
+; RV32-NEXT: lui a4, 449390
+; RV32-NEXT: add a3, a5, a3
+; RV32-NEXT: lui a5, 748983
+; RV32-NEXT: addi a6, a6, -1755
+; RV32-NEXT: addi a4, a4, -1171
+; RV32-NEXT: addi a5, a5, -585
+; RV32-NEXT: and a2, a3, a2
+; RV32-NEXT: mulhu a3, a2, a6
+; RV32-NEXT: slli a6, a3, 3
+; RV32-NEXT: add a2, a2, a3
+; RV32-NEXT: sub a2, a2, a6
+; RV32-NEXT: sub a3, a0, a2
+; RV32-NEXT: sltu a0, a0, a2
+; RV32-NEXT: mul a2, a3, a4
+; RV32-NEXT: mulhu a4, a3, a5
+; RV32-NEXT: sub a1, a1, a0
+; RV32-NEXT: add a2, a4, a2
+; RV32-NEXT: mul a1, a1, a5
+; RV32-NEXT: add a1, a2, a1
+; RV32-NEXT: mul a0, a3, a5
; RV32-NEXT: ret
;
; RV64-LABEL: test_udiv_7:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 7
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __udivti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: srli a7, a1, 26
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: lui a3, 748983
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: addi a3, a3, -585
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: slli a5, a3, 33
+; RV64-NEXT: add a3, a3, a5
+; RV64-NEXT: and a5, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a5, a4, a5
+; RV64-NEXT: add a5, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: srli a4, a1, 56
+; RV64-NEXT: add a2, a2, a4
+; RV64-NEXT: lui a4, %hi(.LCPI2_0)
+; RV64-NEXT: add a7, a5, a7
+; RV64-NEXT: and a5, a5, a6
+; RV64-NEXT: add a2, a7, a2
+; RV64-NEXT: and a7, a7, a6
+; RV64-NEXT: sltu a5, a7, a5
+; RV64-NEXT: lui a7, %hi(.LCPI2_1)
+; RV64-NEXT: ld a4, %lo(.LCPI2_0)(a4)
+; RV64-NEXT: ld a7, %lo(.LCPI2_1)(a7)
+; RV64-NEXT: add a2, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: mulhu a4, a2, a4
+; RV64-NEXT: slli a5, a4, 3
+; RV64-NEXT: add a2, a2, a4
+; RV64-NEXT: sub a2, a2, a5
+; RV64-NEXT: sub a4, a0, a2
+; RV64-NEXT: sltu a0, a0, a2
+; RV64-NEXT: mul a2, a4, a7
+; RV64-NEXT: mulhu a5, a4, a3
+; RV64-NEXT: sub a1, a1, a0
+; RV64-NEXT: add a2, a5, a2
+; RV64-NEXT: mul a1, a1, a3
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: mul a0, a4, a3
; RV64-NEXT: ret
%a = udiv iXLen2 %x, 7
ret iXLen2 %a
@@ -143,24 +208,86 @@ define iXLen2 @test_udiv_7(iXLen2 %x) nounwind {
define iXLen2 @test_udiv_9(iXLen2 %x) nounwind {
; RV32-LABEL: test_udiv_9:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 9
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __udivdi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a5, a1, 28
+; RV32-NEXT: lui a6, 233017
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: and a4, a0, a2
+; RV32-NEXT: add a3, a0, a3
+; RV32-NEXT: add a5, a3, a5
+; RV32-NEXT: and a3, a3, a2
+; RV32-NEXT: sltu a3, a3, a4
+; RV32-NEXT: lui a4, 582542
+; RV32-NEXT: addi a6, a6, -455
+; RV32-NEXT: addi a4, a4, 910
+; RV32-NEXT: add a3, a5, a3
+; RV32-NEXT: and a2, a3, a2
+; RV32-NEXT: mulhu a3, a2, a6
+; RV32-NEXT: srli a3, a3, 1
+; RV32-NEXT: slli a5, a3, 3
+; RV32-NEXT: sub a2, a2, a3
+; RV32-NEXT: sub a2, a2, a5
+; RV32-NEXT: sub a3, a0, a2
+; RV32-NEXT: sltu a0, a0, a2
+; RV32-NEXT: mul a2, a3, a4
+; RV32-NEXT: mulhu a4, a3, a6
+; RV32-NEXT: sub a1, a1, a0
+; RV32-NEXT: add a2, a4, a2
+; RV32-NEXT: mul a1, a1, a6
+; RV32-NEXT: add a1, a2, a1
+; RV32-NEXT: mul a0, a3, a6
; RV32-NEXT: ret
;
; RV64-LABEL: test_udiv_9:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 9
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __udivti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: srli a7, a1, 26
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: srli a3, a1, 56
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: and a5, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a5, a4, a5
+; RV64-NEXT: add a5, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: lui a4, %hi(.LCPI3_0)
+; RV64-NEXT: add a2, a2, a3
+; RV64-NEXT: lui a3, %hi(.LCPI3_1)
+; RV64-NEXT: add a7, a5, a7
+; RV64-NEXT: and a5, a5, a6
+; RV64-NEXT: add a2, a7, a2
+; RV64-NEXT: and a7, a7, a6
+; RV64-NEXT: sltu a5, a7, a5
+; RV64-NEXT: lui a7, %hi(.LCPI3_2)
+; RV64-NEXT: ld a4, %lo(.LCPI3_0)(a4)
+; RV64-NEXT: ld a3, %lo(.LCPI3_1)(a3)
+; RV64-NEXT: ld a7, %lo(.LCPI3_2)(a7)
+; RV64-NEXT: add a2, a2, a5
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: mulhu a4, a2, a4
+; RV64-NEXT: slli a5, a4, 3
+; RV64-NEXT: sub a2, a2, a4
+; RV64-NEXT: sub a2, a2, a5
+; RV64-NEXT: sub a4, a0, a2
+; RV64-NEXT: sltu a0, a0, a2
+; RV64-NEXT: mul a2, a4, a3
+; RV64-NEXT: mulhu a3, a4, a7
+; RV64-NEXT: sub a1, a1, a0
+; RV64-NEXT: add a2, a3, a2
+; RV64-NEXT: mul a1, a1, a7
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: mul a0, a4, a7
; RV64-NEXT: ret
%a = udiv iXLen2 %x, 9
ret iXLen2 %a
diff --git a/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll b/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll
index bc4a99a00ac64..1680ea7d8da30 100644
--- a/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll
+++ b/llvm/test/CodeGen/RISCV/split-urem-by-constant.ll
@@ -79,24 +79,63 @@ define iXLen2 @test_urem_5(iXLen2 %x) nounwind {
define iXLen2 @test_urem_7(iXLen2 %x) nounwind {
; RV32-LABEL: test_urem_7:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 7
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __umoddi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a1, a1, 28
+; RV32-NEXT: lui a5, 149797
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: addi a4, a5, -1755
+; RV32-NEXT: and a5, a0, a2
+; RV32-NEXT: add a0, a0, a3
+; RV32-NEXT: and a3, a0, a2
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: sltu a1, a3, a5
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: and a0, a0, a2
+; RV32-NEXT: mulhu a1, a0, a4
+; RV32-NEXT: slli a2, a1, 3
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: sub a0, a0, a2
+; RV32-NEXT: li a1, 0
; RV32-NEXT: ret
;
; RV64-LABEL: test_urem_7:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 7
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __umodti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: srli a3, a1, 26
+; RV64-NEXT: srli a1, a1, 56
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: lui a5, %hi(.LCPI2_0)
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: ld a5, %lo(.LCPI2_0)(a5)
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a0, a4, a0
+; RV64-NEXT: add a0, a2, a0
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: and a4, a0, a6
+; RV64-NEXT: add a0, a0, a3
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: and a2, a0, a6
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: sltu a1, a2, a4
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: mulhu a1, a0, a5
+; RV64-NEXT: slli a2, a1, 3
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: sub a0, a0, a2
+; RV64-NEXT: li a1, 0
; RV64-NEXT: ret
%a = urem iXLen2 %x, 7
ret iXLen2 %a
@@ -105,24 +144,64 @@ define iXLen2 @test_urem_7(iXLen2 %x) nounwind {
define iXLen2 @test_urem_9(iXLen2 %x) nounwind {
; RV32-LABEL: test_urem_9:
; RV32: # %bb.0:
-; RV32-NEXT: addi sp, sp, -16
-; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
-; RV32-NEXT: li a2, 9
-; RV32-NEXT: li a3, 0
-; RV32-NEXT: call __umoddi3
-; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
-; RV32-NEXT: addi sp, sp, 16
+; RV32-NEXT: lui a2, 262144
+; RV32-NEXT: slli a3, a1, 2
+; RV32-NEXT: srli a4, a0, 30
+; RV32-NEXT: srli a1, a1, 28
+; RV32-NEXT: lui a5, 233017
+; RV32-NEXT: addi a2, a2, -1
+; RV32-NEXT: or a3, a4, a3
+; RV32-NEXT: addi a4, a5, -455
+; RV32-NEXT: and a5, a0, a2
+; RV32-NEXT: add a0, a0, a3
+; RV32-NEXT: and a3, a0, a2
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: sltu a1, a3, a5
+; RV32-NEXT: add a0, a0, a1
+; RV32-NEXT: and a0, a0, a2
+; RV32-NEXT: mulhu a1, a0, a4
+; RV32-NEXT: srli a1, a1, 1
+; RV32-NEXT: slli a2, a1, 3
+; RV32-NEXT: sub a0, a0, a1
+; RV32-NEXT: sub a0, a0, a2
+; RV32-NEXT: li a1, 0
; RV32-NEXT: ret
;
; RV64-LABEL: test_urem_9:
; RV64: # %bb.0:
-; RV64-NEXT: addi sp, sp, -16
-; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
-; RV64-NEXT: li a2, 9
-; RV64-NEXT: li a3, 0
-; RV64-NEXT: call __umodti3
-; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
-; RV64-NEXT: addi sp, sp, 16
+; RV64-NEXT: slli a2, a1, 4
+; RV64-NEXT: srli a3, a0, 60
+; RV64-NEXT: slli a4, a1, 34
+; RV64-NEXT: srli a5, a0, 30
+; RV64-NEXT: lui a6, 262144
+; RV64-NEXT: or a2, a3, a2
+; RV64-NEXT: srli a3, a1, 26
+; RV64-NEXT: srli a1, a1, 56
+; RV64-NEXT: or a4, a5, a4
+; RV64-NEXT: lui a5, %hi(.LCPI3_0)
+; RV64-NEXT: addi a6, a6, -1
+; RV64-NEXT: ld a5, %lo(.LCPI3_0)(a5)
+; RV64-NEXT: add a4, a0, a4
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: add a2, a4, a2
+; RV64-NEXT: and a4, a4, a6
+; RV64-NEXT: sltu a0, a4, a0
+; RV64-NEXT: add a0, a2, a0
+; RV64-NEXT: and a2, a2, a6
+; RV64-NEXT: sltu a2, a2, a4
+; RV64-NEXT: and a4, a0, a6
+; RV64-NEXT: add a0, a0, a3
+; RV64-NEXT: add a1, a2, a1
+; RV64-NEXT: and a2, a0, a6
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: sltu a1, a2, a4
+; RV64-NEXT: add a0, a0, a1
+; RV64-NEXT: and a0, a0, a6
+; RV64-NEXT: mulhu a1, a0, a5
+; RV64-NEXT: slli a2, a1, 3
+; RV64-NEXT: sub a0, a0, a1
+; RV64-NEXT: sub a0, a0, a2
+; RV64-NEXT: li a1, 0
; RV64-NEXT: ret
%a = urem iXLen2 %x, 9
ret iXLen2 %a
diff --git a/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll b/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
index 3ef9f3f945108..77a026669c51d 100644
--- a/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
+++ b/llvm/test/CodeGen/RISCV/urem-vector-lkk.ll
@@ -862,51 +862,61 @@ define <4 x i64> @dont_fold_urem_i64(<4 x i64> %x) nounwind {
; RV32IM-NEXT: sw s5, 20(sp) # 4-byte Folded Spill
; RV32IM-NEXT: sw s6, 16(sp) # 4-byte Folded Spill
; RV32IM-NEXT: sw s7, 12(sp) # 4-byte Folded Spill
-; RV32IM-NEXT: sw s8, 8(sp) # 4-byte Folded Spill
-; RV32IM-NEXT: lw s1, 16(a1)
-; RV32IM-NEXT: lw s2, 20(a1)
-; RV32IM-NEXT: lw s3, 24(a1)
-; RV32IM-NEXT: lw s4, 28(a1)
-; RV32IM-NEXT: lw a3, 0(a1)
-; RV32IM-NEXT: lw a4, 4(a1)
-; RV32IM-NEXT: lw s5, 8(a1)
-; RV32IM-NEXT: lw s6, 12(...
[truncated]
|
// We restrict to small chunk sizes (e.g., ≤ 32 bits) to ensure that all | ||
// operations remain legal on most targets. | ||
unsigned MaxChunk = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should choose a maximum size based on the set of legal types and operations instead of just guessing that 32 is good
if (BestChunkWidth > 0) { | ||
EVT ChunkVT = EVT::getIntegerVT(*DAG.getContext(), BestChunkWidth); | ||
|
||
if (!LL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Braces, but I'm not really sure why this is conditional in the first place
if (getBooleanContents(ChunkVT) == | ||
TargetLoweringBase::ZeroOrOneBooleanContent) | ||
NewCarry = DAG.getZExtOrTrunc(NewCarry, dl, ChunkVT); | ||
else | ||
NewCarry = DAG.getSelect(dl, ChunkVT, NewCarry, | ||
DAG.getConstant(1, dl, ChunkVT), | ||
DAG.getConstant(0, dl, ChunkVT)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're doing the zext in either case, so just do the zext. It doesn't depend on the boolean contents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern is repeated in multiple places in the type legalizer. I suspect the getZExtOrTrunc pattern gives better results, but we should confirm.
if (!LL) | ||
std::tie(LL, LH) = | ||
DAG.SplitScalar(N->getOperand(0), dl, HiLoVT, HiLoVT); | ||
SDValue In = DAG.getNode(ISD::BUILD_PAIR, dl, VT, LL, LH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You only need the BUILD_PAIR if LL is set. Otherwise, you can use N->getOperand(0) as In
.
This patch improves the lowering of 128-bit unsigned division and remainder by constants (UDIV/UREM) by avoiding a fallback to libcall (__udivti3/uremti3) for specific divisors.
When a divisor D satisfies the condition (1 << ChunkWidth) % D == 1, the 128-bit value is split into fixed-width chunks (e.g., 30-bit) and summed before applying a smaller UDIV/UREM. This transformation is based on the "remainder by summing digits" trick described in Hacker’s Delight.
This fixes PR137514 for some constants.