Skip to content

[LegalizeTypes] Expand 128-bit UDIV/UREM by constant via Chunk Addition #146238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 74 additions & 2 deletions llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7981,8 +7981,6 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,

// If (1 << HBitWidth) % divisor == 1, we can add the two halves together and
// then add in the carry.
// TODO: If we can't split it in half, we might be able to split into 3 or
// more pieces using a smaller bit width.
if (HalfMaxPlus1.urem(Divisor).isOne()) {
assert(!LL == !LH && "Expected both input halves or no input halves!");
if (!LL)
Expand Down Expand Up @@ -8030,6 +8028,80 @@ bool TargetLowering::expandDIVREMByConstant(SDNode *N,
DAG.getConstant(0, dl, HiLoVT));
Sum = DAG.getNode(ISD::ADD, dl, HiLoVT, Sum, Carry);
}

} else {
// If we cannot split in two halves. Let's look for a smaller chunk
// width where (1 << ChunkWidth) mod Divisor == 1.
// This ensures that the sum of all such chunks modulo Divisor
// is equivalent to the original value modulo Divisor.
const APInt &Divisor = CN->getAPIntValue();
unsigned BitWidth = VT.getScalarSizeInBits();
unsigned BestChunkWidth = 0;

// We restrict to small chunk sizes (e.g., ≤ 32 bits) to ensure that all
// operations remain legal on most targets.
unsigned MaxChunk = 32;
Comment on lines +8041 to +8043
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should choose a maximum size based on the set of legal types and operations instead of just guessing that 32 is good

for (int i = MaxChunk; i >= 1; --i) {
APInt ChunkMaxPlus1 = APInt::getOneBitSet(BitWidth, i);
if (ChunkMaxPlus1.urem(Divisor).isOne()) {
BestChunkWidth = i;
break;
}
}

// If we found a good chunk width, slice the number and sum the pieces.
if (BestChunkWidth > 0) {
EVT ChunkVT = EVT::getIntegerVT(*DAG.getContext(), BestChunkWidth);

if (!LL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces, but I'm not really sure why this is conditional in the first place

std::tie(LL, LH) =
DAG.SplitScalar(N->getOperand(0), dl, HiLoVT, HiLoVT);
SDValue In = DAG.getNode(ISD::BUILD_PAIR, dl, VT, LL, LH);
Copy link
Collaborator

@topperc topperc Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need the BUILD_PAIR if LL is set. Otherwise, you can use N->getOperand(0) as In.


SmallVector<SDValue, 8> Parts;
// Split into fixed-size chunks
for (unsigned i = 0; i < BitWidth; i += BestChunkWidth) {
SDValue Shift = DAG.getShiftAmountConstant(i, VT, dl);
SDValue Chunk = DAG.getNode(ISD::SRL, dl, VT, In, Shift);
Chunk = DAG.getNode(ISD::TRUNCATE, dl, ChunkVT, Chunk);
Parts.push_back(Chunk);
}
if (Parts.empty())
return false;
Sum = Parts[0];

// Use uaddo_carry if we can, otherwise use a compare to detect overflow.
// same logic as used in above if condition.
SDValue Carry = DAG.getConstant(0, dl, ChunkVT);
EVT SetCCType =
getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), ChunkVT);
for (unsigned i = 1; i < Parts.size(); ++i) {
if (isOperationLegalOrCustom(ISD::UADDO_CARRY, ChunkVT)) {
SDVTList VTList = DAG.getVTList(ChunkVT, SetCCType);
SDValue UAdd = DAG.getNode(ISD::UADDO, dl, VTList, Sum, Parts[i]);
Sum = DAG.getNode(ISD::UADDO_CARRY, dl, VTList, UAdd, Carry,
UAdd.getValue(1));
} else {
SDValue Add = DAG.getNode(ISD::ADD, dl, ChunkVT, Sum, Parts[i]);
SDValue NewCarry = DAG.getSetCC(dl, SetCCType, Add, Sum, ISD::SETULT);

if (getBooleanContents(ChunkVT) ==
TargetLoweringBase::ZeroOrOneBooleanContent)
NewCarry = DAG.getZExtOrTrunc(NewCarry, dl, ChunkVT);
else
NewCarry = DAG.getSelect(dl, ChunkVT, NewCarry,
DAG.getConstant(1, dl, ChunkVT),
DAG.getConstant(0, dl, ChunkVT));
Comment on lines +8088 to +8094
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're doing the zext in either case, so just do the zext. It doesn't depend on the boolean contents

Copy link
Collaborator

@topperc topperc Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is repeated in multiple places in the type legalizer. I suspect the getZExtOrTrunc pattern gives better results, but we should confirm.


Sum = DAG.getNode(ISD::ADD, dl, ChunkVT, Add, Carry);
Carry = NewCarry;
}
}

Sum = DAG.getNode(ISD::ZERO_EXTEND, dl, HiLoVT, Sum);
} else {
return false;
}
}

// If we didn't find a sum, we can't do the expansion.
Expand Down
80 changes: 70 additions & 10 deletions llvm/test/CodeGen/RISCV/div-by-constant.ll
Original file line number Diff line number Diff line change
Expand Up @@ -115,16 +115,76 @@ define i64 @udiv64_constant_no_add(i64 %a) nounwind {
}

define i64 @udiv64_constant_add(i64 %a) nounwind {
; RV32-LABEL: udiv64_constant_add:
; RV32: # %bb.0:
; RV32-NEXT: addi sp, sp, -16
; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
; RV32-NEXT: li a2, 7
; RV32-NEXT: li a3, 0
; RV32-NEXT: call __udivdi3
; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32-NEXT: addi sp, sp, 16
; RV32-NEXT: ret
; RV32IM-LABEL: udiv64_constant_add:
; RV32IM: # %bb.0:
; RV32IM-NEXT: lui a2, 262144
; RV32IM-NEXT: slli a3, a1, 2
; RV32IM-NEXT: srli a4, a0, 30
; RV32IM-NEXT: srli a5, a1, 28
; RV32IM-NEXT: lui a6, 149797
; RV32IM-NEXT: addi a2, a2, -1
; RV32IM-NEXT: or a3, a4, a3
; RV32IM-NEXT: and a4, a0, a2
; RV32IM-NEXT: add a3, a0, a3
; RV32IM-NEXT: add a5, a3, a5
; RV32IM-NEXT: and a3, a3, a2
; RV32IM-NEXT: sltu a3, a3, a4
; RV32IM-NEXT: lui a4, 449390
; RV32IM-NEXT: add a3, a5, a3
; RV32IM-NEXT: lui a5, 748983
; RV32IM-NEXT: addi a6, a6, -1755
; RV32IM-NEXT: addi a4, a4, -1171
; RV32IM-NEXT: addi a5, a5, -585
; RV32IM-NEXT: and a2, a3, a2
; RV32IM-NEXT: mulhu a3, a2, a6
; RV32IM-NEXT: slli a6, a3, 3
; RV32IM-NEXT: add a2, a2, a3
; RV32IM-NEXT: sub a2, a2, a6
; RV32IM-NEXT: sub a3, a0, a2
; RV32IM-NEXT: sltu a0, a0, a2
; RV32IM-NEXT: mul a2, a3, a4
; RV32IM-NEXT: mulhu a4, a3, a5
; RV32IM-NEXT: sub a1, a1, a0
; RV32IM-NEXT: add a2, a4, a2
; RV32IM-NEXT: mul a1, a1, a5
; RV32IM-NEXT: add a1, a2, a1
; RV32IM-NEXT: mul a0, a3, a5
; RV32IM-NEXT: ret
;
; RV32IMZB-LABEL: udiv64_constant_add:
; RV32IMZB: # %bb.0:
; RV32IMZB-NEXT: srli a2, a0, 30
; RV32IMZB-NEXT: srli a3, a1, 28
; RV32IMZB-NEXT: lui a4, 786432
; RV32IMZB-NEXT: slli a5, a0, 2
; RV32IMZB-NEXT: lui a6, 149797
; RV32IMZB-NEXT: sh2add a2, a1, a2
; RV32IMZB-NEXT: srli a5, a5, 2
; RV32IMZB-NEXT: add a2, a0, a2
; RV32IMZB-NEXT: add a3, a2, a3
; RV32IMZB-NEXT: andn a2, a2, a4
; RV32IMZB-NEXT: sltu a2, a2, a5
; RV32IMZB-NEXT: lui a5, 449390
; RV32IMZB-NEXT: add a2, a3, a2
; RV32IMZB-NEXT: lui a3, 748983
; RV32IMZB-NEXT: addi a6, a6, -1755
; RV32IMZB-NEXT: addi a5, a5, -1171
; RV32IMZB-NEXT: addi a3, a3, -585
; RV32IMZB-NEXT: andn a2, a2, a4
; RV32IMZB-NEXT: mulhu a4, a2, a6
; RV32IMZB-NEXT: slli a6, a4, 3
; RV32IMZB-NEXT: add a2, a2, a4
; RV32IMZB-NEXT: sub a2, a2, a6
; RV32IMZB-NEXT: sub a4, a0, a2
; RV32IMZB-NEXT: sltu a0, a0, a2
; RV32IMZB-NEXT: mul a2, a4, a5
; RV32IMZB-NEXT: mulhu a5, a4, a3
; RV32IMZB-NEXT: sub a1, a1, a0
; RV32IMZB-NEXT: add a2, a5, a2
; RV32IMZB-NEXT: mul a1, a1, a3
; RV32IMZB-NEXT: add a1, a2, a1
; RV32IMZB-NEXT: mul a0, a4, a3
; RV32IMZB-NEXT: ret
;
; RV64-LABEL: udiv64_constant_add:
; RV64: # %bb.0:
Expand Down
183 changes: 155 additions & 28 deletions llvm/test/CodeGen/RISCV/split-udiv-by-constant.ll
Original file line number Diff line number Diff line change
Expand Up @@ -117,24 +117,89 @@ define iXLen2 @test_udiv_5(iXLen2 %x) nounwind {
define iXLen2 @test_udiv_7(iXLen2 %x) nounwind {
; RV32-LABEL: test_udiv_7:
; RV32: # %bb.0:
; RV32-NEXT: addi sp, sp, -16
; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
; RV32-NEXT: li a2, 7
; RV32-NEXT: li a3, 0
; RV32-NEXT: call __udivdi3
; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32-NEXT: addi sp, sp, 16
; RV32-NEXT: lui a2, 262144
; RV32-NEXT: slli a3, a1, 2
; RV32-NEXT: srli a4, a0, 30
; RV32-NEXT: srli a5, a1, 28
; RV32-NEXT: lui a6, 149797
; RV32-NEXT: addi a2, a2, -1
; RV32-NEXT: or a3, a4, a3
; RV32-NEXT: and a4, a0, a2
; RV32-NEXT: add a3, a0, a3
; RV32-NEXT: add a5, a3, a5
; RV32-NEXT: and a3, a3, a2
; RV32-NEXT: sltu a3, a3, a4
; RV32-NEXT: lui a4, 449390
; RV32-NEXT: add a3, a5, a3
; RV32-NEXT: lui a5, 748983
; RV32-NEXT: addi a6, a6, -1755
; RV32-NEXT: addi a4, a4, -1171
; RV32-NEXT: addi a5, a5, -585
; RV32-NEXT: and a2, a3, a2
; RV32-NEXT: mulhu a3, a2, a6
; RV32-NEXT: slli a6, a3, 3
; RV32-NEXT: add a2, a2, a3
; RV32-NEXT: sub a2, a2, a6
; RV32-NEXT: sub a3, a0, a2
; RV32-NEXT: sltu a0, a0, a2
; RV32-NEXT: mul a2, a3, a4
; RV32-NEXT: mulhu a4, a3, a5
; RV32-NEXT: sub a1, a1, a0
; RV32-NEXT: add a2, a4, a2
; RV32-NEXT: mul a1, a1, a5
; RV32-NEXT: add a1, a2, a1
; RV32-NEXT: mul a0, a3, a5
; RV32-NEXT: ret
;
; RV64-LABEL: test_udiv_7:
; RV64: # %bb.0:
; RV64-NEXT: addi sp, sp, -16
; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
; RV64-NEXT: li a2, 7
; RV64-NEXT: li a3, 0
; RV64-NEXT: call __udivti3
; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
; RV64-NEXT: addi sp, sp, 16
; RV64-NEXT: slli a2, a1, 4
; RV64-NEXT: srli a3, a0, 60
; RV64-NEXT: slli a4, a1, 34
; RV64-NEXT: srli a5, a0, 30
; RV64-NEXT: lui a6, 262144
; RV64-NEXT: srli a7, a1, 26
; RV64-NEXT: or a2, a3, a2
; RV64-NEXT: lui a3, 748983
; RV64-NEXT: or a4, a5, a4
; RV64-NEXT: addi a6, a6, -1
; RV64-NEXT: addi a3, a3, -585
; RV64-NEXT: add a4, a0, a4
; RV64-NEXT: slli a5, a3, 33
; RV64-NEXT: add a3, a3, a5
; RV64-NEXT: and a5, a0, a6
; RV64-NEXT: add a2, a4, a2
; RV64-NEXT: and a4, a4, a6
; RV64-NEXT: sltu a5, a4, a5
; RV64-NEXT: add a5, a2, a5
; RV64-NEXT: and a2, a2, a6
; RV64-NEXT: sltu a2, a2, a4
; RV64-NEXT: srli a4, a1, 56
; RV64-NEXT: add a2, a2, a4
; RV64-NEXT: lui a4, %hi(.LCPI2_0)
; RV64-NEXT: add a7, a5, a7
; RV64-NEXT: and a5, a5, a6
; RV64-NEXT: add a2, a7, a2
; RV64-NEXT: and a7, a7, a6
; RV64-NEXT: sltu a5, a7, a5
; RV64-NEXT: lui a7, %hi(.LCPI2_1)
; RV64-NEXT: ld a4, %lo(.LCPI2_0)(a4)
; RV64-NEXT: ld a7, %lo(.LCPI2_1)(a7)
; RV64-NEXT: add a2, a2, a5
; RV64-NEXT: and a2, a2, a6
; RV64-NEXT: mulhu a4, a2, a4
; RV64-NEXT: slli a5, a4, 3
; RV64-NEXT: add a2, a2, a4
; RV64-NEXT: sub a2, a2, a5
; RV64-NEXT: sub a4, a0, a2
; RV64-NEXT: sltu a0, a0, a2
; RV64-NEXT: mul a2, a4, a7
; RV64-NEXT: mulhu a5, a4, a3
; RV64-NEXT: sub a1, a1, a0
; RV64-NEXT: add a2, a5, a2
; RV64-NEXT: mul a1, a1, a3
; RV64-NEXT: add a1, a2, a1
; RV64-NEXT: mul a0, a4, a3
; RV64-NEXT: ret
%a = udiv iXLen2 %x, 7
ret iXLen2 %a
Expand All @@ -143,24 +208,86 @@ define iXLen2 @test_udiv_7(iXLen2 %x) nounwind {
define iXLen2 @test_udiv_9(iXLen2 %x) nounwind {
; RV32-LABEL: test_udiv_9:
; RV32: # %bb.0:
; RV32-NEXT: addi sp, sp, -16
; RV32-NEXT: sw ra, 12(sp) # 4-byte Folded Spill
; RV32-NEXT: li a2, 9
; RV32-NEXT: li a3, 0
; RV32-NEXT: call __udivdi3
; RV32-NEXT: lw ra, 12(sp) # 4-byte Folded Reload
; RV32-NEXT: addi sp, sp, 16
; RV32-NEXT: lui a2, 262144
; RV32-NEXT: slli a3, a1, 2
; RV32-NEXT: srli a4, a0, 30
; RV32-NEXT: srli a5, a1, 28
; RV32-NEXT: lui a6, 233017
; RV32-NEXT: addi a2, a2, -1
; RV32-NEXT: or a3, a4, a3
; RV32-NEXT: and a4, a0, a2
; RV32-NEXT: add a3, a0, a3
; RV32-NEXT: add a5, a3, a5
; RV32-NEXT: and a3, a3, a2
; RV32-NEXT: sltu a3, a3, a4
; RV32-NEXT: lui a4, 582542
; RV32-NEXT: addi a6, a6, -455
; RV32-NEXT: addi a4, a4, 910
; RV32-NEXT: add a3, a5, a3
; RV32-NEXT: and a2, a3, a2
; RV32-NEXT: mulhu a3, a2, a6
; RV32-NEXT: srli a3, a3, 1
; RV32-NEXT: slli a5, a3, 3
; RV32-NEXT: sub a2, a2, a3
; RV32-NEXT: sub a2, a2, a5
; RV32-NEXT: sub a3, a0, a2
; RV32-NEXT: sltu a0, a0, a2
; RV32-NEXT: mul a2, a3, a4
; RV32-NEXT: mulhu a4, a3, a6
; RV32-NEXT: sub a1, a1, a0
; RV32-NEXT: add a2, a4, a2
; RV32-NEXT: mul a1, a1, a6
; RV32-NEXT: add a1, a2, a1
; RV32-NEXT: mul a0, a3, a6
; RV32-NEXT: ret
;
; RV64-LABEL: test_udiv_9:
; RV64: # %bb.0:
; RV64-NEXT: addi sp, sp, -16
; RV64-NEXT: sd ra, 8(sp) # 8-byte Folded Spill
; RV64-NEXT: li a2, 9
; RV64-NEXT: li a3, 0
; RV64-NEXT: call __udivti3
; RV64-NEXT: ld ra, 8(sp) # 8-byte Folded Reload
; RV64-NEXT: addi sp, sp, 16
; RV64-NEXT: slli a2, a1, 4
; RV64-NEXT: srli a3, a0, 60
; RV64-NEXT: slli a4, a1, 34
; RV64-NEXT: srli a5, a0, 30
; RV64-NEXT: lui a6, 262144
; RV64-NEXT: srli a7, a1, 26
; RV64-NEXT: or a2, a3, a2
; RV64-NEXT: srli a3, a1, 56
; RV64-NEXT: or a4, a5, a4
; RV64-NEXT: addi a6, a6, -1
; RV64-NEXT: add a4, a0, a4
; RV64-NEXT: and a5, a0, a6
; RV64-NEXT: add a2, a4, a2
; RV64-NEXT: and a4, a4, a6
; RV64-NEXT: sltu a5, a4, a5
; RV64-NEXT: add a5, a2, a5
; RV64-NEXT: and a2, a2, a6
; RV64-NEXT: sltu a2, a2, a4
; RV64-NEXT: lui a4, %hi(.LCPI3_0)
; RV64-NEXT: add a2, a2, a3
; RV64-NEXT: lui a3, %hi(.LCPI3_1)
; RV64-NEXT: add a7, a5, a7
; RV64-NEXT: and a5, a5, a6
; RV64-NEXT: add a2, a7, a2
; RV64-NEXT: and a7, a7, a6
; RV64-NEXT: sltu a5, a7, a5
; RV64-NEXT: lui a7, %hi(.LCPI3_2)
; RV64-NEXT: ld a4, %lo(.LCPI3_0)(a4)
; RV64-NEXT: ld a3, %lo(.LCPI3_1)(a3)
; RV64-NEXT: ld a7, %lo(.LCPI3_2)(a7)
; RV64-NEXT: add a2, a2, a5
; RV64-NEXT: and a2, a2, a6
; RV64-NEXT: mulhu a4, a2, a4
; RV64-NEXT: slli a5, a4, 3
; RV64-NEXT: sub a2, a2, a4
; RV64-NEXT: sub a2, a2, a5
; RV64-NEXT: sub a4, a0, a2
; RV64-NEXT: sltu a0, a0, a2
; RV64-NEXT: mul a2, a4, a3
; RV64-NEXT: mulhu a3, a4, a7
; RV64-NEXT: sub a1, a1, a0
; RV64-NEXT: add a2, a3, a2
; RV64-NEXT: mul a1, a1, a7
; RV64-NEXT: add a1, a2, a1
; RV64-NEXT: mul a0, a4, a7
; RV64-NEXT: ret
%a = udiv iXLen2 %x, 9
ret iXLen2 %a
Expand Down
Loading
Loading