-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[WIP][DAG] Introduce generic shl_add node [NFC] #88791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures: * LEA on X86 * sh1add, sh2add, and sh3add on RISCV * adds on AArch64 The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA. I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation. In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture. I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms. In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.
@llvm/pr-subscribers-backend-risc-v @llvm/pr-subscribers-backend-x86 Author: Philip Reames (preames) ChangesThis patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures:
The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA. I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation. In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture. I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms. In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings. Full diff: https://github.com/llvm/llvm-project/pull/88791.diff 10 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/ISDOpcodes.h b/llvm/include/llvm/CodeGen/ISDOpcodes.h
index 49d51a27e3c0f6..a45e5b26a58198 100644
--- a/llvm/include/llvm/CodeGen/ISDOpcodes.h
+++ b/llvm/include/llvm/CodeGen/ISDOpcodes.h
@@ -710,6 +710,13 @@ enum NodeType {
FSHL,
FSHR,
+ // Represents (ADD (SHL a, b), c) with the arguments appearing in the order
+ // a, b, c. 'b' must be a constant, and follows the rules for shift amount
+ // types described just above. This is used soley post-legalization when
+ // lowering MUL to target specific instructions - e.g. LEA on x86 or
+ // sh1add/sh2add/sh3add on RISCV.
+ SHL_ADD,
+
/// Byte Swap and Counting operators.
BSWAP,
CTTZ,
diff --git a/llvm/include/llvm/Target/TargetSelectionDAG.td b/llvm/include/llvm/Target/TargetSelectionDAG.td
index ea3520835fa07d..aeef292d34b85b 100644
--- a/llvm/include/llvm/Target/TargetSelectionDAG.td
+++ b/llvm/include/llvm/Target/TargetSelectionDAG.td
@@ -121,6 +121,10 @@ def SDTIntShiftOp : SDTypeProfile<1, 2, [ // shl, sra, srl
def SDTIntShiftDOp: SDTypeProfile<1, 3, [ // fshl, fshr
SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisInt<0>, SDTCisInt<3>
]>;
+def SDTIntShiftAddOp : SDTypeProfile<1, 3, [ // shl_add
+ SDTCisSameAs<0, 1>, SDTCisSameAs<0, 3>, SDTCisInt<0>, SDTCisInt<2>,
+ SDTCisInt<3>
+]>;
def SDTIntSatNoShOp : SDTypeProfile<1, 2, [ // ssat with no shift
SDTCisSameAs<0, 1>, SDTCisInt<2>
]>;
@@ -411,6 +415,7 @@ def rotl : SDNode<"ISD::ROTL" , SDTIntShiftOp>;
def rotr : SDNode<"ISD::ROTR" , SDTIntShiftOp>;
def fshl : SDNode<"ISD::FSHL" , SDTIntShiftDOp>;
def fshr : SDNode<"ISD::FSHR" , SDTIntShiftDOp>;
+def shl_add : SDNode<"ISD::SHL_ADD" , SDTIntShiftAddOp>;
def and : SDNode<"ISD::AND" , SDTIntBinOp,
[SDNPCommutative, SDNPAssociative]>;
def or : SDNode<"ISD::OR" , SDTIntBinOp,
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
index ca0a95750ba8d8..59385c8bc27925 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
@@ -3521,6 +3521,13 @@ KnownBits SelectionDAG::computeKnownBits(SDValue Op, const APInt &DemandedElts,
Known = KnownBits::ashr(Known, Known2, /*ShAmtNonZero=*/false,
Op->getFlags().hasExact());
break;
+ case ISD::SHL_ADD:
+ Known = computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+ Known2 = computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+ Known = KnownBits::computeForAddSub(true, false, false,
+ KnownBits::shl(Known, Known2),
+ computeKnownBits(Op.getOperand(2), DemandedElts, Depth + 1));
+ break;
case ISD::FSHL:
case ISD::FSHR:
if (ConstantSDNode *C = isConstOrConstSplat(Op.getOperand(2), DemandedElts)) {
@@ -7346,6 +7353,11 @@ SDValue SelectionDAG::getNode(unsigned Opcode, const SDLoc &DL, EVT VT,
if (N1.getValueType() == VT)
return N1;
break;
+ case ISD::SHL_ADD:
+ assert(VT == N1.getValueType() && VT == N3.getValueType());
+ assert(TLI->isTypeLegal(VT) && "Created only post legalize");
+ assert(isa<ConstantSDNode>(N2) && "Constant shift expected");
+ break;
}
// Memoize node if it doesn't produce a glue result.
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
index 6691aa41face39..cc9dafcfa0c728 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
@@ -264,6 +264,7 @@ std::string SDNode::getOperationName(const SelectionDAG *G) const {
case ISD::SRL: return "srl";
case ISD::ROTL: return "rotl";
case ISD::ROTR: return "rotr";
+ case ISD::SHL_ADD: return "shl_add";
case ISD::FSHL: return "fshl";
case ISD::FSHR: return "fshr";
case ISD::FADD: return "fadd";
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 27387595164a46..cc64ccbedee92c 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12789,10 +12789,9 @@ static SDValue transformAddShlImm(SDNode *N, SelectionDAG &DAG,
SDLoc DL(N);
SDValue NS = (C0 < C1) ? N0->getOperand(0) : N1->getOperand(0);
SDValue NL = (C0 > C1) ? N0->getOperand(0) : N1->getOperand(0);
- SDValue NA0 =
- DAG.getNode(ISD::SHL, DL, VT, NL, DAG.getConstant(Diff, DL, VT));
- SDValue NA1 = DAG.getNode(ISD::ADD, DL, VT, NA0, NS);
- return DAG.getNode(ISD::SHL, DL, VT, NA1, DAG.getConstant(Bits, DL, VT));
+ SDValue SHADD =
+ DAG.getNode(ISD::SHL_ADD, DL, VT, NL, DAG.getConstant(Diff, DL, VT), NS);
+ return DAG.getNode(ISD::SHL, DL, VT, SHADD, DAG.getConstant(Bits, DL, VT));
}
// Combine a constant select operand into its use:
@@ -13028,14 +13027,17 @@ static SDValue combineAddOfBooleanXor(SDNode *N, SelectionDAG &DAG) {
N0.getOperand(0));
}
-static SDValue performADDCombine(SDNode *N, SelectionDAG &DAG,
+static SDValue performADDCombine(SDNode *N,
+ TargetLowering::DAGCombinerInfo &DCI,
const RISCVSubtarget &Subtarget) {
+ SelectionDAG &DAG = DCI.DAG;
if (SDValue V = combineAddOfBooleanXor(N, DAG))
return V;
if (SDValue V = transformAddImmMulImm(N, DAG, Subtarget))
return V;
- if (SDValue V = transformAddShlImm(N, DAG, Subtarget))
- return V;
+ if (!DCI.isBeforeLegalize())
+ if (SDValue V = transformAddShlImm(N, DAG, Subtarget))
+ return V;
if (SDValue V = combineBinOpToReduce(N, DAG, Subtarget))
return V;
if (SDValue V = combineBinOpOfExtractToReduceTree(N, DAG, Subtarget))
@@ -15894,7 +15896,7 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
return V;
if (SDValue V = combineToVWMACC(N, DAG, Subtarget))
return V;
- return performADDCombine(N, DAG, Subtarget);
+ return performADDCombine(N, DCI, Subtarget);
}
case ISD::SUB: {
if (SDValue V = combineBinOp_VLToVWBinOp_VL(N, DCI, Subtarget))
diff --git a/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td b/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
index 434b071e628a0e..8837c66d603779 100644
--- a/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
+++ b/llvm/lib/Target/RISCV/RISCVInstrInfoZb.td
@@ -678,6 +678,8 @@ foreach i = {1,2,3} in {
defvar shxadd = !cast<Instruction>("SH"#i#"ADD");
def : Pat<(XLenVT (add_like_non_imm12 (shl GPR:$rs1, (XLenVT i)), GPR:$rs2)),
(shxadd GPR:$rs1, GPR:$rs2)>;
+ def : Pat<(XLenVT (shl_add GPR:$rs1, (XLenVT i), GPR:$rs2)),
+ (shxadd GPR:$rs1, GPR:$rs2)>;
defvar pat = !cast<ComplexPattern>("sh"#i#"add_op");
// More complex cases use a ComplexPattern.
@@ -881,6 +883,9 @@ foreach i = {1,2,3} in {
defvar shxadd = !cast<Instruction>("SH"#i#"ADD");
def : Pat<(i32 (add_like_non_imm12 (shl GPR:$rs1, (i64 i)), GPR:$rs2)),
(shxadd GPR:$rs1, GPR:$rs2)>;
+ def : Pat<(i32 (shl_add GPR:$rs1, (i32 i), GPR:$rs2)),
+ (shxadd GPR:$rs1, GPR:$rs2)>;
+
}
}
diff --git a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
index 4e4241efd63d6b..50e271bde041ad 100644
--- a/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
+++ b/llvm/lib/Target/X86/X86ISelDAGToDAG.cpp
@@ -2519,7 +2519,6 @@ bool X86DAGToDAGISel::matchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
if (N.getResNo() != 0) break;
[[fallthrough]];
case ISD::MUL:
- case X86ISD::MUL_IMM:
// X*[3,5,9] -> X+X*[2,4,8]
if (AM.BaseType == X86ISelAddressMode::RegBase &&
AM.Base_Reg.getNode() == nullptr &&
@@ -2551,7 +2550,45 @@ bool X86DAGToDAGISel::matchAddressRecursively(SDValue N, X86ISelAddressMode &AM,
}
}
break;
-
+ case ISD::SHL_ADD: {
+ // X << [1,2,3] + Y (we should never create anything else)
+ auto *CN = cast<ConstantSDNode>(N.getOperand(1));
+ assert(CN->getZExtValue() == 1 || CN->getZExtValue() == 2 ||
+ CN->getZExtValue() == 3);
+ if (AM.BaseType == X86ISelAddressMode::RegBase &&
+ AM.Base_Reg.getNode() == nullptr &&
+ AM.IndexReg.getNode() == nullptr) {
+ AM.Scale = unsigned(2 << (CN->getZExtValue() - 1));
+
+ if (N.getOperand(0) == N.getOperand(2)) {
+ SDValue MulVal = N.getOperand(0);
+ SDValue Reg;
+
+ // Okay, we know that we have a scale by now. However, if the scaled
+ // value is an add of something and a constant, we can fold the
+ // constant into the disp field here.
+ if (MulVal.getNode()->getOpcode() == ISD::ADD &&
+ N->isOnlyUserOf(MulVal.getNode()) &&
+ isa<ConstantSDNode>(MulVal.getOperand(1))) {
+ Reg = MulVal.getOperand(0);
+ auto *AddVal = cast<ConstantSDNode>(MulVal.getOperand(1));
+ uint64_t Disp = AddVal->getSExtValue() * (AM.Scale + 1);
+ if (foldOffsetIntoAddress(Disp, AM))
+ Reg = N.getOperand(0);
+ } else {
+ Reg = N.getOperand(0);
+ }
+ AM.IndexReg = AM.Base_Reg = Reg;
+ return false;
+ }
+ // TODO: If N.getOperand(2) is a constant, we could try folding
+ // the displacement analogously to the above.
+ AM.IndexReg = N.getOperand(0);
+ AM.Base_Reg = N.getOperand(2);
+ return false;
+ }
+ break;
+ }
case ISD::SUB: {
// Given A-B, if A can be completely folded into the address and
// the index field with the index field unused, use -B as the index.
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index f16a751a166d69..477c368c22e8ea 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -33553,7 +33553,6 @@ const char *X86TargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(BZHI)
NODE_NAME_CASE(PDEP)
NODE_NAME_CASE(PEXT)
- NODE_NAME_CASE(MUL_IMM)
NODE_NAME_CASE(MOVMSK)
NODE_NAME_CASE(PTEST)
NODE_NAME_CASE(TESTP)
@@ -36845,13 +36844,6 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
Known.resetAll();
switch (Opc) {
default: break;
- case X86ISD::MUL_IMM: {
- KnownBits Known2;
- Known = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
- Known2 = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
- Known = KnownBits::mul(Known, Known2);
- break;
- }
case X86ISD::SETCC:
Known.Zero.setBitsFrom(1);
break;
@@ -46905,12 +46897,18 @@ static SDValue reduceVMULWidth(SDNode *N, SelectionDAG &DAG,
return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResLo, ResHi);
}
+static SDValue createMulImm(uint64_t MulAmt, SDValue N, SelectionDAG &DAG,
+ EVT VT, const SDLoc &DL) {
+ assert(MulAmt == 3 || MulAmt == 5 || MulAmt == 9);
+ SDValue ShAmt = DAG.getConstant(Log2_64(MulAmt-1), DL, MVT::i8);
+ return DAG.getNode(ISD::SHL_ADD, DL, VT, N, ShAmt, N);
+}
+
static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
EVT VT, const SDLoc &DL) {
auto combineMulShlAddOrSub = [&](int Mult, int Shift, bool isAdd) {
- SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
- DAG.getConstant(Mult, DL, VT));
+ SDValue Result = createMulImm(Mult, N->getOperand(0), DAG, VT, DL);
Result = DAG.getNode(ISD::SHL, DL, VT, Result,
DAG.getConstant(Shift, DL, MVT::i8));
Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
@@ -46919,10 +46917,8 @@ static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
};
auto combineMulMulAddOrSub = [&](int Mul1, int Mul2, bool isAdd) {
- SDValue Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
- DAG.getConstant(Mul1, DL, VT));
- Result = DAG.getNode(X86ISD::MUL_IMM, DL, VT, Result,
- DAG.getConstant(Mul2, DL, VT));
+ SDValue Result = createMulImm(Mul1, N->getOperand(0), DAG, VT, DL);
+ Result = createMulImm(Mul2, Result, DAG, VT, DL);
Result = DAG.getNode(isAdd ? ISD::ADD : ISD::SUB, DL, VT, Result,
N->getOperand(0));
return Result;
@@ -46982,9 +46978,8 @@ static SDValue combineMulSpecial(uint64_t MulAmt, SDNode *N, SelectionDAG &DAG,
unsigned ShiftAmt = Log2_64((MulAmt & (MulAmt - 1)));
SDValue Shift1 = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
DAG.getConstant(ShiftAmt, DL, MVT::i8));
- SDValue Shift2 = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
- DAG.getConstant(ScaleShift, DL, MVT::i8));
- return DAG.getNode(ISD::ADD, DL, VT, Shift1, Shift2);
+ return DAG.getNode(ISD::SHL_ADD, DL, VT, N->getOperand(0),
+ DAG.getConstant(ScaleShift, DL, MVT::i8), Shift1);
}
}
@@ -47204,8 +47199,7 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
SDValue NewMul = SDValue();
if (VT == MVT::i64 || VT == MVT::i32) {
if (AbsMulAmt == 3 || AbsMulAmt == 5 || AbsMulAmt == 9) {
- NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
- DAG.getConstant(AbsMulAmt, DL, VT));
+ NewMul = createMulImm(AbsMulAmt, N->getOperand(0), DAG, VT, DL);
if (SignMulAmt < 0)
NewMul =
DAG.getNode(ISD::SUB, DL, VT, DAG.getConstant(0, DL, VT), NewMul);
@@ -47243,15 +47237,13 @@ static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
NewMul = DAG.getNode(ISD::SHL, DL, VT, N->getOperand(0),
DAG.getConstant(Log2_64(MulAmt1), DL, MVT::i8));
else
- NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, N->getOperand(0),
- DAG.getConstant(MulAmt1, DL, VT));
+ NewMul = createMulImm(MulAmt1, N->getOperand(0), DAG, VT, DL);
if (isPowerOf2_64(MulAmt2))
NewMul = DAG.getNode(ISD::SHL, DL, VT, NewMul,
DAG.getConstant(Log2_64(MulAmt2), DL, MVT::i8));
else
- NewMul = DAG.getNode(X86ISD::MUL_IMM, DL, VT, NewMul,
- DAG.getConstant(MulAmt2, DL, VT));
+ NewMul = NewMul = createMulImm(MulAmt2, NewMul, DAG, VT, DL);
// Negate the result.
if (SignMulAmt < 0)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index 0a1e8ca4427314..7c5bfac3308c8e 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -417,9 +417,6 @@ namespace llvm {
PDEP,
PEXT,
- // X86-specific multiply by immediate.
- MUL_IMM,
-
// Vector sign bit extraction.
MOVMSK,
diff --git a/llvm/lib/Target/X86/X86InstrFragments.td b/llvm/lib/Target/X86/X86InstrFragments.td
index f14c7200af968a..faeeccab7dac72 100644
--- a/llvm/lib/Target/X86/X86InstrFragments.td
+++ b/llvm/lib/Target/X86/X86InstrFragments.td
@@ -284,8 +284,6 @@ def X86bzhi : SDNode<"X86ISD::BZHI", SDTIntBinOp>;
def X86pdep : SDNode<"X86ISD::PDEP", SDTIntBinOp>;
def X86pext : SDNode<"X86ISD::PEXT", SDTIntBinOp>;
-def X86mul_imm : SDNode<"X86ISD::MUL_IMM", SDTIntBinOp>;
-
def X86DynAlloca : SDNode<"X86ISD::DYN_ALLOCA", SDT_X86DYN_ALLOCA,
[SDNPHasChain, SDNPOutGlue]>;
@@ -341,11 +339,11 @@ def X86cmpccxadd : SDNode<"X86ISD::CMPCCXADD", SDTX86Cmpccxadd,
// Define X86-specific addressing mode.
def addr : ComplexPattern<iPTR, 5, "selectAddr", [], [SDNPWantParent]>;
def lea32addr : ComplexPattern<i32, 5, "selectLEAAddr",
- [add, sub, mul, X86mul_imm, shl, or, xor, frameindex],
+ [add, sub, mul, shl_add, shl, or, xor, frameindex],
[]>;
// In 64-bit mode 32-bit LEAs can use RIP-relative addressing.
def lea64_32addr : ComplexPattern<i32, 5, "selectLEA64_32Addr",
- [add, sub, mul, X86mul_imm, shl, or, xor,
+ [add, sub, mul, shl_add, shl, or, xor,
frameindex, X86WrapperRIP],
[]>;
@@ -356,7 +354,7 @@ def tls32baseaddr : ComplexPattern<i32, 5, "selectTLSADDRAddr",
[tglobaltlsaddr], []>;
def lea64addr : ComplexPattern<i64, 5, "selectLEAAddr",
- [add, sub, mul, X86mul_imm, shl, or, xor, frameindex,
+ [add, sub, mul, shl_add, shl, or, xor, frameindex,
X86WrapperRIP], []>;
def tls64addr : ComplexPattern<i64, 5, "selectTLSADDRAddr",
|
You can test this locally with the following command:git-clang-format --diff b7b183371b54a2c4b5d2a39c594d3967a7384cb8 33fd866e5fe4711a3dcd906b1f0d3449bf5e7d09 -- llvm/include/llvm/CodeGen/ISDOpcodes.h llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/X86/X86ISelDAGToDAG.cpp llvm/lib/Target/X86/X86ISelLowering.cpp llvm/lib/Target/X86/X86ISelLowering.h View the diff from clang-format here.diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 1748b372e2..749d9f7378 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -36843,7 +36843,8 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
Known.resetAll();
switch (Opc) {
- default: break;
+ default:
+ break;
case X86ISD::SETCC:
Known.Zero.setBitsFrom(1);
break;
|
AMDGPU has these too, sometimes |
Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me - needing MUL_IMM at all has always annoyed me, its often needed to stop extra combines that would prevent us matching for LEA etc. but that often leads to more complex code than necessary.
There's the possibility we could use SHL_ADD for vector gather/scatter address math (vector index + offset, imm scalar shift amt), but I don't see anything here that would prevent that in the future?
Note - there are a few real clang-format warnings that need cleaning up.
This implements a RISCV specific version of the SHL_ADD node proposed in llvm#88791. If that lands, the infrastructure from this patch should seemlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positve.
Real ones fixed, remaining failure appears to be a misconfiguration in the bot. |
// a, b, c. 'b' must be a constant, and follows the rules for shift amount | ||
// types described just above. This is used soley post-legalization when | ||
// lowering MUL to target specific instructions - e.g. LEA on x86 or | ||
// sh1add/sh2add/sh3add on RISCV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RISCV -> RISC-V
SDValue NA1 = DAG.getNode(ISD::ADD, DL, VT, NA0, NS); | ||
return DAG.getNode(ISD::SHL, DL, VT, NA1, DAG.getConstant(Bits, DL, VT)); | ||
SDValue SHADD = | ||
DAG.getNode(ISD::SHL_ADD, DL, VT, NL, DAG.getConstant(Diff, DL, VT), NS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use TargetConstant if its required to be a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed over in (#89263).
EVT VT, const SDLoc &DL) { | ||
assert(MulAmt == 3 || MulAmt == 5 || MulAmt == 9); | ||
SDValue ShAmt = DAG.getConstant(Log2_64(MulAmt - 1), DL, MVT::i8); | ||
return DAG.getNode(ISD::SHL_ADD, DL, VT, N, ShAmt, N); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to cause issue with poison? We've now increased the use count of N.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That... is a good question. We probably need to freeze here since we're increasing the number of uses, I had not considered that. Let me add the freeze and see if that influences codegen in practice. If it does, we may need to consider both a SHL_ADD node and a MUL359 node. I'm hoping we don't, let me investigate and report back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the news here is not good. Adding in Freeze in the x86 backend code causes a whole bunch of regressions that were not obvious on first glance. Interestingly, incorporating the same logic into the RISC-V specific version of this patch (#89263) doesn't seem to expose the same kind of problems - mostly likely because the usage is much more isolated. #89290 fixes an analogous freeze issue in code already landed, again with no visible code diff.
I think what I'd like to suggest here is that we go ahead and focus review on #89263. Once we land that, I can iterate in tree on the RISC-V specific parts, and then rebase this patch on a fully fleshed through implementation and focus it on the x86 merge. (I clearly need to track something down there.)
(For the record, the issue @dtcxzyw flagged in the RISCV specific part of this patch doesn't exist in #89263 as I focused on a different subset there. That's probably confusing for reviewers in retrospect, sorry!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated these differences further. Net result is one fairly obvious missed optimization, one somewhat complicated but reasonable issue with COPY elimination, and one fundamental issue. I'm going to focus on only the last.
We end up with a situation where an inserted freeze gets hoisted through a chain of computation. This is all correct and fine, but as a side effect of that hoisting, we strip nsw off an add. The net result is that we can't prove a narrow addressing sequence is equivalent to the wider form, and thus fail to be able to fold a constant base offset into the addressing mode.
I'm a bit stuck on what to do about this case, and need to give this more thought.
This patch is not a NFC. Example: https://godbolt.org/z/PMTqjf45M
Before:
After:
|
Looks like maybe I missed a pattern for the .uw variant, will go track this down. Can I ask how you're finding these? This is the second time you've chimed in with a really useful counter example on these threads. Are you fuzzing the changes? Much appreciated. |
topperc pointed this out in review of llvm#88791, but I believe the problem applies here as well. Worth noting is that the code I introduced with this bug was mostly copied from other targets - which also have this bug.
These cases are sampled from my opt-benchmark. It is built from some popular real-world libraries/applications. BTW, it has been used to assist in the review of InstCombine-related patches.
I don't see any value in the fuzzer-generated cases (at least in terms of performance). See also https://llvm.org/docs/InstCombineContributorGuide.html#real-world-usefulness. |
I've confirmed this was a missing pattern, and have a local patch with a fix separated based on the RISCVISD::SHL_ADD change. I am not going to post it until the RISCVISD::SHL_ADD patch lands to avoid confusing us all. |
…89263) This implements a RISCV specific version of the SHL_ADD node proposed in #88791. If that lands, the infrastructure from this patch should seamlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positive.
…ombine (#89263)" Changes since original commit: * Rebase over improved test coverage for theadba * Revert change to use TargetConstant as it appears to prevent the uimm2 clause from matching in the XTheadBa patterns. * Fix an order of operands bug in the THeadBa pattern visible in the new test coverage. Original commit message follows: This implements a RISCV specific version of the SHL_ADD node proposed in #88791. If that lands, the infrastructure from this patch should seamlessly switch over the to generic DAG node. I'm posting this separately because I've run out of useful multiply strength reduction work to do without having a way to represent MUL X, 3/5/9 as a single instruction. The majority of this change is moving two sets of patterns out of tablgen and into the post-legalize combine. The major reason for this is that I have an upcoming change which needs to reuse the expansion logic, but it also helps common up some code between zba and the THeadBa variants. On the test changes, there's a couple major categories: * We chose a different lowering for mul x, 25. The new lowering involves one fewer register and the same critical path, so this seems like a win. * The order of the two multiplies changes in (3,5,9)*(3,5,9) in some cases. I don't believe this matters. * I'm removing the one use restriction on the multiply. This restriction doesn't really make sense to me, and the test changes appear positive.
Doing so avoids negative interactions with other combines which don't know the shl_add is a single instruction. From the commit log, we've had several combine loops already. This was originally posted as part of llvm#88791, where a bug was pointed out. That bug was fixed by llvm#89789 which hits the same issue from another angle. To confirm the fix, I included the reduce test case here.
Doing so avoids negative interactions with other combines which don't know the shl_add is a single instruction. From the commit log, we've had several combine loops already. This was originally posted as part of #88791, where a bug was pointed out. That bug was fixed by #89789 which hits the same issue from another angle. To confirm the fix, I included the reduced test case here.
Please start with (#89263)
This patch introduces a generic SHL_ADD node in SelectionDAG. This node has the semantics of (add (shl A, ConstantB), C) but performed as single instruction. This corresponds to instructions on multiple architectures:
The initial use case is during expansion of scalar multiplies. The X86 backend has long had the MUL_IMM node to prevent combine loops. As actually used, MUL_IMM only supported the immediates 3, 6, and 9. Not coincidentally, these are the multiplies which can be done via an LEA.
I'm in the process of extending our scalar multiplication code on RISCV, and am going to need a similar construct. Rather than reinventing the wheel, it seems better to introduce a common representation.
In the current form, I am not proposing legalization, or any generic combines to form shl_add. In the current structure patch, shl_add is only created post legalization and always corresponds to a legal instruction. (This is what x86 had before.) Note that the range of valid constant shift amounts differs on each architecture.
I am hoping that I'll be able to restructure some of the x86 code into generic combine infrastructure to share mul expansion, but that is an unproven theory at the moment. Longer term, we may also be able to form shl_add earlier when legal, but that has the challenge of potentially interfering with other desireable transforms.
In the current patch, I am not proposing any flags for shl_add as the current use case doesn't require them, but in principal, I see nothing which prevents us from later supporting nuw, nsw and disjoint with the obvious meanings.