Skip to content

[X86] Handle BSF/BSR "zero-input pass through" behaviour #123623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 23, 2025

Conversation

RKSimon
Copy link
Collaborator

@RKSimon RKSimon commented Jan 20, 2025

Intel docs have been updated to be similar to AMD and now describe BSF/BSR as not changing the destination register if the input value was zero, which allows us to support CTTZ/CTLZ zero-input cases by setting the destination to support a NumBits result (BSR is a bit messy as it has to XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also been confirmed to match this behaviour.

This patch adjusts the X86ISD::BSF/BSR nodes to take a "pass through" argument for zero-input cases, by default this is set to UNDEF to match existing behaviour, but it can be set to a suitable value if supported.

There are still some limits to this - its only supported for x86_64 capable processors (and I've only enabled it for x86_64 codegen), and Intel CPUs sometimes zero the upper 32-bits of a pass through register when used for BSR32/BSF32 with a zero source value (i.e. the whole 64bits may not get passed through).

Fixes #122004

Copy link

github-actions bot commented Jan 20, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 4bcdb26dac4cdadd7f8850a5f9b2e775b73aaf7f cf42716949909450e5c47246425461a3b6921c52 --extensions cpp,h,inc -- llvm/lib/Target/X86/X86ISelLowering.cpp llvm/lib/Target/X86/X86InstrInfo.cpp llvm/lib/Target/X86/X86Subtarget.h llvm/test/TableGen/x86-fold-tables.inc
View the diff from clang-format here.
diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp
index 794aa921ca..310da59bbf 100644
--- a/llvm/lib/Target/X86/X86InstrInfo.cpp
+++ b/llvm/lib/Target/X86/X86InstrInfo.cpp
@@ -5220,15 +5220,17 @@ inline static bool isDefConvertible(const MachineInstr &MI, bool &NoSignFlag,
 }
 
 /// Check whether the use can be converted to remove a comparison against zero.
-/// Returns the EFLAGS condition and the operand that we are comparing against zero.
-static std::pair<X86::CondCode, unsigned> isUseDefConvertible(const MachineInstr &MI) {
+/// Returns the EFLAGS condition and the operand that we are comparing against
+/// zero.
+static std::pair<X86::CondCode, unsigned>
+isUseDefConvertible(const MachineInstr &MI) {
   switch (MI.getOpcode()) {
   default:
     return std::make_pair(X86::COND_INVALID, ~0U);
-  CASE_ND(NEG8r)
-  CASE_ND(NEG16r)
-  CASE_ND(NEG32r)
-  CASE_ND(NEG64r)
+    CASE_ND(NEG8r)
+    CASE_ND(NEG16r)
+    CASE_ND(NEG32r)
+    CASE_ND(NEG64r)
     return std::make_pair(X86::COND_AE, 1U);
   case X86::LZCNT16rr:
   case X86::LZCNT32rr:

@RKSimon RKSimon force-pushed the x86-bitscan-fallthrough branch 2 times, most recently from fe9b29b to 403573f Compare January 20, 2025 18:57
Intel docs have been updated to be similar to AMD and now describe BSF/BSF as not changing the destination register if the input value was zero, which allows us to support CTTZ/CTLZ zero-input cases by setting the destination to support a NumBits result (BSR is a bit messy as it has to XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also been confirmed to match this behaviour.

There are still some limits to this - its only supported for x86_64 capable processors (and I've only enabled it for x86_64 codegen), and there are some Intel CPUs that don't correctly zero the upper 32-bits of a pass through register when used for BSR32/BSF32 with a zero source value (i.e. the whole 64bits may get p[assed through).
@RKSimon RKSimon force-pushed the x86-bitscan-fallthrough branch from 403573f to b01975b Compare January 21, 2025 10:33
With fall through support, we don't need to load i64 constants anymore
@RKSimon RKSimon marked this pull request as ready for review January 21, 2025 10:43
@llvmbot
Copy link
Member

llvmbot commented Jan 21, 2025

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

Intel docs have been updated to be similar to AMD and now describe BSF/BSF as not changing the destination register if the input value was zero, which allows us to support CTTZ/CTLZ zero-input cases by setting the destination to support a NumBits result (BSR is a bit messy as it has to XOR'd to create a CTLZ result). VIA/Zhaoxin x86_64 CPUs have also been confirmed to match this behaviour.

This patch adjusts the X86ISD::BSR/BSF nodes to take a "pass through" argument for zero-input cases, by default this is set to UNDEF to match existing behaviour, but it can be set to a suitable value if supported.

There are still some limits to this - its only supported for x86_64 capable processors (and I've only enabled it for x86_64 codegen), and Intel CPUs sometimes zero the upper 32-bits of a pass through register when used for BSR32/BSF32 with a zero source value (i.e. the whole 64bits may not get passed through).

Fixes #122004


Patch is 56.66 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123623.diff

18 Files Affected:

  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+51-13)
  • (modified) llvm/lib/Target/X86/X86InstrCompiler.td (+6-6)
  • (modified) llvm/lib/Target/X86/X86InstrFragments.td (+8-5)
  • (modified) llvm/lib/Target/X86/X86InstrInfo.cpp (+13-12)
  • (modified) llvm/lib/Target/X86/X86InstrMisc.td (+25-25)
  • (modified) llvm/lib/Target/X86/X86Subtarget.h (+5)
  • (modified) llvm/test/CodeGen/X86/bit_ceil.ll (+4-8)
  • (modified) llvm/test/CodeGen/X86/combine-or.ll (+2-4)
  • (modified) llvm/test/CodeGen/X86/ctlo.ll (+5-9)
  • (modified) llvm/test/CodeGen/X86/ctlz.ll (+11-20)
  • (modified) llvm/test/CodeGen/X86/cttz.ll (+7-15)
  • (modified) llvm/test/CodeGen/X86/known-never-zero.ll (+74-142)
  • (modified) llvm/test/CodeGen/X86/pr89877.ll (+3-5)
  • (modified) llvm/test/CodeGen/X86/pr90847.ll (+6-10)
  • (modified) llvm/test/CodeGen/X86/pr92569.ll (+4-6)
  • (modified) llvm/test/CodeGen/X86/scheduler-backtracking.ll (+64-76)
  • (modified) llvm/test/TableGen/x86-fold-tables.inc (+6-6)
  • (modified) llvm/test/tools/llvm-mca/X86/BtVer2/clear-super-register-1.s (+3-3)
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 87f3f7984989e1..7c7e16abff7fe3 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -436,7 +436,6 @@ X86TargetLowering::X86TargetLowering(const X86TargetMachine &TM,
     setOperationAction(ISD::CTTZ           , MVT::i32  , Custom);
     setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i32  , Legal);
     if (Subtarget.is64Bit()) {
-      setOperationPromotedToType(ISD::CTTZ , MVT::i32, MVT::i64);
       setOperationAction(ISD::CTTZ         , MVT::i64  , Custom);
       setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i64, Legal);
     }
@@ -3386,15 +3385,19 @@ bool X86TargetLowering::shouldFormOverflowOp(unsigned Opcode, EVT VT,
 }
 
 bool X86TargetLowering::isCheapToSpeculateCttz(Type *Ty) const {
-  // Speculate cttz only if we can directly use TZCNT or can promote to i32/i64.
+  // Speculate cttz only if we can directly use TZCNT/CMOV, can promote to
+  // i32/i64 or can rely on BSF passthrough value.
   return Subtarget.hasBMI() || Subtarget.canUseCMOV() ||
+         Subtarget.hasBitScanPassThrough() ||
          (!Ty->isVectorTy() &&
           Ty->getScalarSizeInBits() < (Subtarget.is64Bit() ? 64u : 32u));
 }
 
 bool X86TargetLowering::isCheapToSpeculateCtlz(Type *Ty) const {
-  // Speculate ctlz only if we can directly use LZCNT.
-  return Subtarget.hasLZCNT() || Subtarget.canUseCMOV();
+  // Speculate ctlz only if we can directly use LZCNT/CMOV, or can rely on BSR
+  // passthrough value.
+  return Subtarget.hasLZCNT() || Subtarget.canUseCMOV() ||
+         Subtarget.hasBitScanPassThrough();
 }
 
 bool X86TargetLowering::ShouldShrinkFPConstant(EVT VT) const {
@@ -28694,11 +28697,18 @@ static SDValue LowerCTLZ(SDValue Op, const X86Subtarget &Subtarget,
     Op = DAG.getNode(ISD::ZERO_EXTEND, dl, OpVT, Op);
   }
 
+  // Check if we can safely pass a result though BSR for zero sources.
+  SDValue PassThru = DAG.getUNDEF(OpVT);
+  if (Opc == ISD::CTLZ && Subtarget.hasBitScanPassThrough() &&
+      !DAG.isKnownNeverZero(Op))
+    PassThru = DAG.getConstant(NumBits + NumBits - 1, dl, OpVT);
+
   // Issue a bsr (scan bits in reverse) which also sets EFLAGS.
   SDVTList VTs = DAG.getVTList(OpVT, MVT::i32);
-  Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op);
+  Op = DAG.getNode(X86ISD::BSR, dl, VTs, PassThru, Op);
 
-  if (Opc == ISD::CTLZ) {
+  // Skip CMOV if we're using a pass through value.
+  if (Opc == ISD::CTLZ && PassThru.isUndef()) {
     // If src is zero (i.e. bsr sets ZF), returns NumBits.
     SDValue Ops[] = {Op, DAG.getConstant(NumBits + NumBits - 1, dl, OpVT),
                      DAG.getTargetConstant(X86::COND_E, dl, MVT::i8),
@@ -28721,16 +28731,22 @@ static SDValue LowerCTTZ(SDValue Op, const X86Subtarget &Subtarget,
   unsigned NumBits = VT.getScalarSizeInBits();
   SDValue N0 = Op.getOperand(0);
   SDLoc dl(Op);
+  bool NonZeroSrc = DAG.isKnownNeverZero(N0);
 
   assert(!VT.isVector() && Op.getOpcode() == ISD::CTTZ &&
          "Only scalar CTTZ requires custom lowering");
 
+  // Check if we can safely pass a result though BSF for zero sources.
+  SDValue PassThru = DAG.getUNDEF(VT);
+  if (!NonZeroSrc && Subtarget.hasBitScanPassThrough())
+    PassThru = DAG.getConstant(NumBits, dl, VT);
+
   // Issue a bsf (scan bits forward) which also sets EFLAGS.
   SDVTList VTs = DAG.getVTList(VT, MVT::i32);
-  Op = DAG.getNode(X86ISD::BSF, dl, VTs, N0);
+  Op = DAG.getNode(X86ISD::BSF, dl, VTs, PassThru, N0);
 
-  // If src is known never zero we can skip the CMOV.
-  if (DAG.isKnownNeverZero(N0))
+  // Skip CMOV if src is never zero or we're using a pass through value.
+  if (NonZeroSrc || !PassThru.isUndef())
     return Op;
 
   // If src is zero (i.e. bsf sets ZF), returns NumBits.
@@ -38193,12 +38209,34 @@ void X86TargetLowering::computeKnownBitsForTargetNode(const SDValue Op,
     Known = KnownBits::mul(Known, Known2);
     break;
   }
-  case X86ISD::BSR:
-    // BSR(0) is undef, but any use of BSR already accounts for non-zero inputs.
-    // Similar KnownBits behaviour to CTLZ_ZERO_UNDEF.
+  case X86ISD::BSF: {
+    Known.Zero.setBitsFrom(Log2_32(BitWidth));
+
+    KnownBits Known2;
+    Known2 = DAG.computeKnownBits(Op.getOperand(1), DemandedElts, Depth + 1);
+    if (Known2.isNonZero()) {
+      // If we have a known 1, its position is our upper bound.
+      unsigned PossibleTZ = Known2.countMaxTrailingZeros();
+      unsigned LowBits = llvm::bit_width(PossibleTZ);
+      Known.Zero.setBitsFrom(LowBits);
+    } else if (!Op.getOperand(0).isUndef()) {
+      Known2 = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+      Known = Known.intersectWith(Known2);
+    }
+    break;
+  }
+  case X86ISD::BSR: {
     // TODO: Bound with input known bits?
     Known.Zero.setBitsFrom(Log2_32(BitWidth));
+
+    if (!Op.getOperand(0).isUndef() &&
+        !DAG.isKnownNeverZero(Op.getOperand(1), Depth + 1)) {
+      KnownBits Known2;
+      Known2 = DAG.computeKnownBits(Op.getOperand(0), DemandedElts, Depth + 1);
+      Known = Known.intersectWith(Known2);
+    }
     break;
+  }
   case X86ISD::SETCC:
     Known.Zero.setBitsFrom(1);
     break;
@@ -54243,7 +54281,7 @@ static SDValue combineXorSubCTLZ(SDNode *N, const SDLoc &DL, SelectionDAG &DAG,
   }
 
   SDVTList VTs = DAG.getVTList(OpVT, MVT::i32);
-  Op = DAG.getNode(X86ISD::BSR, DL, VTs, Op);
+  Op = DAG.getNode(X86ISD::BSR, DL, VTs, DAG.getUNDEF(OpVT), Op);
   if (VT == MVT::i8)
     Op = DAG.getNode(ISD::TRUNCATE, DL, MVT::i8, Op);
 
diff --git a/llvm/lib/Target/X86/X86InstrCompiler.td b/llvm/lib/Target/X86/X86InstrCompiler.td
index 7d4c5c0e10e492..9bda3fd7d951c9 100644
--- a/llvm/lib/Target/X86/X86InstrCompiler.td
+++ b/llvm/lib/Target/X86/X86InstrCompiler.td
@@ -2213,12 +2213,12 @@ def : Pat<(mul (loadi64 addr:$src1), i64immSExt32:$src2),
           (IMUL64rmi32 addr:$src1, i64immSExt32:$src2)>;
 
 // Bit scan instruction patterns to match explicit zero-undef behavior.
-def : Pat<(cttz_zero_undef GR16:$src), (BSF16rr GR16:$src)>;
-def : Pat<(cttz_zero_undef GR32:$src), (BSF32rr GR32:$src)>;
-def : Pat<(cttz_zero_undef GR64:$src), (BSF64rr GR64:$src)>;
-def : Pat<(cttz_zero_undef (loadi16 addr:$src)), (BSF16rm addr:$src)>;
-def : Pat<(cttz_zero_undef (loadi32 addr:$src)), (BSF32rm addr:$src)>;
-def : Pat<(cttz_zero_undef (loadi64 addr:$src)), (BSF64rm addr:$src)>;
+def : Pat<(cttz_zero_undef GR16:$src), (BSF16rr (i16 (IMPLICIT_DEF)), GR16:$src)>;
+def : Pat<(cttz_zero_undef GR32:$src), (BSF32rr (i32 (IMPLICIT_DEF)), GR32:$src)>;
+def : Pat<(cttz_zero_undef GR64:$src), (BSF64rr (i64 (IMPLICIT_DEF)), GR64:$src)>;
+def : Pat<(cttz_zero_undef (loadi16 addr:$src)), (BSF16rm (i16 (IMPLICIT_DEF)), addr:$src)>;
+def : Pat<(cttz_zero_undef (loadi32 addr:$src)), (BSF32rm (i32 (IMPLICIT_DEF)), addr:$src)>;
+def : Pat<(cttz_zero_undef (loadi64 addr:$src)), (BSF64rm (i64 (IMPLICIT_DEF)), addr:$src)>;
 
 // When HasMOVBE is enabled it is possible to get a non-legalized
 // register-register 16 bit bswap. This maps it to a ROL instruction.
diff --git a/llvm/lib/Target/X86/X86InstrFragments.td b/llvm/lib/Target/X86/X86InstrFragments.td
index ea7af893ce103f..8f038330cd2398 100644
--- a/llvm/lib/Target/X86/X86InstrFragments.td
+++ b/llvm/lib/Target/X86/X86InstrFragments.td
@@ -134,8 +134,8 @@ def SDTX86Cmpccxadd : SDTypeProfile<1, 4, [SDTCisSameAs<0, 2>,
 def X86MFence : SDNode<"X86ISD::MFENCE", SDTNone, [SDNPHasChain]>;
 
 
-def X86bsf     : SDNode<"X86ISD::BSF",      SDTUnaryArithWithFlags>;
-def X86bsr     : SDNode<"X86ISD::BSR",      SDTUnaryArithWithFlags>;
+def X86bsf     : SDNode<"X86ISD::BSF",      SDTBinaryArithWithFlags>;
+def X86bsr     : SDNode<"X86ISD::BSR",      SDTBinaryArithWithFlags>;
 def X86fshl    : SDNode<"X86ISD::FSHL",     SDTIntShiftDOp>;
 def X86fshr    : SDNode<"X86ISD::FSHR",     SDTIntShiftDOp>;
 
@@ -685,8 +685,9 @@ def anyext_sdiv : PatFrag<(ops node:$lhs), (anyext node:$lhs),[{
 // register. Truncate can be lowered to EXTRACT_SUBREG. CopyFromReg may
 // be copying from a truncate. AssertSext/AssertZext/AssertAlign aren't saying
 // anything about the upper 32 bits, they're probably just qualifying a
-// CopyFromReg. FREEZE may be coming from a a truncate. Any other 32-bit
-// operation will zero-extend up to 64 bits.
+// CopyFromReg. FREEZE may be coming from a a truncate. BitScan fall through
+// values may not zero the upper bits correctly.
+// Any other 32-bit operation will zero-extend up to 64 bits.
 def def32 : PatLeaf<(i32 GR32:$src), [{
   return N->getOpcode() != ISD::TRUNCATE &&
          N->getOpcode() != TargetOpcode::EXTRACT_SUBREG &&
@@ -694,7 +695,9 @@ def def32 : PatLeaf<(i32 GR32:$src), [{
          N->getOpcode() != ISD::AssertSext &&
          N->getOpcode() != ISD::AssertZext &&
          N->getOpcode() != ISD::AssertAlign &&
-         N->getOpcode() != ISD::FREEZE;
+         N->getOpcode() != ISD::FREEZE &&
+         !((N->getOpcode() == X86ISD::BSF|| N->getOpcode() == X86ISD::BSR) &&
+           (!N->getOperand(0).isUndef() && !isa<ConstantSDNode>(N->getOperand(0))));
 }]>;
 
 // Treat an 'or' node is as an 'add' if the or'ed bits are known to be zero.
diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp
index 30a5161bbcc502..562c76a1029a10 100644
--- a/llvm/lib/Target/X86/X86InstrInfo.cpp
+++ b/llvm/lib/Target/X86/X86InstrInfo.cpp
@@ -5221,42 +5221,42 @@ inline static bool isDefConvertible(const MachineInstr &MI, bool &NoSignFlag,
 }
 
 /// Check whether the use can be converted to remove a comparison against zero.
-static X86::CondCode isUseDefConvertible(const MachineInstr &MI) {
+static std::pair<X86::CondCode, unsigned> isUseDefConvertible(const MachineInstr &MI) {
   switch (MI.getOpcode()) {
   default:
-    return X86::COND_INVALID;
+    return std::make_pair(X86::COND_INVALID, ~0U);
   CASE_ND(NEG8r)
   CASE_ND(NEG16r)
   CASE_ND(NEG32r)
   CASE_ND(NEG64r)
-    return X86::COND_AE;
+    return std::make_pair(X86::COND_AE, 1U);
   case X86::LZCNT16rr:
   case X86::LZCNT32rr:
   case X86::LZCNT64rr:
-    return X86::COND_B;
+    return std::make_pair(X86::COND_B, 1U);
   case X86::POPCNT16rr:
   case X86::POPCNT32rr:
   case X86::POPCNT64rr:
-    return X86::COND_E;
+    return std::make_pair(X86::COND_E, 1U);
   case X86::TZCNT16rr:
   case X86::TZCNT32rr:
   case X86::TZCNT64rr:
-    return X86::COND_B;
+    return std::make_pair(X86::COND_B, 1U);
   case X86::BSF16rr:
   case X86::BSF32rr:
   case X86::BSF64rr:
   case X86::BSR16rr:
   case X86::BSR32rr:
   case X86::BSR64rr:
-    return X86::COND_E;
+    return std::make_pair(X86::COND_E, 2U);
   case X86::BLSI32rr:
   case X86::BLSI64rr:
-    return X86::COND_AE;
+    return std::make_pair(X86::COND_AE, 1U);
   case X86::BLSR32rr:
   case X86::BLSR64rr:
   case X86::BLSMSK32rr:
   case X86::BLSMSK64rr:
-    return X86::COND_B;
+    return std::make_pair(X86::COND_B, 1U);
     // TODO: TBM instructions.
   }
 }
@@ -5337,6 +5337,7 @@ bool X86InstrInfo::optimizeCompareInstr(MachineInstr &CmpInstr, Register SrcReg,
   bool ClearsOverflowFlag = false;
   bool ShouldUpdateCC = false;
   bool IsSwapped = false;
+  unsigned OpNo = 0;
   X86::CondCode NewCC = X86::COND_INVALID;
   int64_t ImmDelta = 0;
 
@@ -5392,9 +5393,9 @@ bool X86InstrInfo::optimizeCompareInstr(MachineInstr &CmpInstr, Register SrcReg,
         //      ...                 // EFLAGS not changed
         //      testl %eax, %eax    // <-- can be removed
         if (IsCmpZero) {
-          NewCC = isUseDefConvertible(Inst);
-          if (NewCC != X86::COND_INVALID && Inst.getOperand(1).isReg() &&
-              Inst.getOperand(1).getReg() == SrcReg) {
+          std::tie(NewCC, OpNo) = isUseDefConvertible(Inst);
+          if (NewCC != X86::COND_INVALID && Inst.getOperand(OpNo).isReg() &&
+              Inst.getOperand(OpNo).getReg() == SrcReg) {
             ShouldUpdateCC = true;
             MI = &Inst;
             break;
diff --git a/llvm/lib/Target/X86/X86InstrMisc.td b/llvm/lib/Target/X86/X86InstrMisc.td
index 43c02c4f85844c..290d91bb2ce699 100644
--- a/llvm/lib/Target/X86/X86InstrMisc.td
+++ b/llvm/lib/Target/X86/X86InstrMisc.td
@@ -247,55 +247,55 @@ def BSWAP64r : RI<0xC8, AddRegFrm, (outs GR64:$dst), (ins GR64:$src),
 } // Constraints = "$src = $dst", SchedRW
 
 // Bit scan instructions.
-let Defs = [EFLAGS] in {
-def BSF16rr  : I<0xBC, MRMSrcReg, (outs GR16:$dst), (ins GR16:$src),
+let Defs = [EFLAGS], Constraints = "$fallback = $dst" in {
+def BSF16rr  : I<0xBC, MRMSrcReg, (outs GR16:$dst), (ins GR16:$fallback, GR16:$src),
                  "bsf{w}\t{$src, $dst|$dst, $src}",
-                 [(set GR16:$dst, EFLAGS, (X86bsf GR16:$src))]>,
+                 [(set GR16:$dst, EFLAGS, (X86bsf GR16:$fallback, GR16:$src))]>,
                   TB, OpSize16, Sched<[WriteBSF]>;
-def BSF16rm  : I<0xBC, MRMSrcMem, (outs GR16:$dst), (ins i16mem:$src),
+def BSF16rm  : I<0xBC, MRMSrcMem, (outs GR16:$dst), (ins GR16:$fallback, i16mem:$src),
                  "bsf{w}\t{$src, $dst|$dst, $src}",
-                 [(set GR16:$dst, EFLAGS, (X86bsf (loadi16 addr:$src)))]>,
+                 [(set GR16:$dst, EFLAGS, (X86bsf GR16:$fallback, (loadi16 addr:$src)))]>,
                  TB, OpSize16, Sched<[WriteBSFLd]>;
-def BSF32rr  : I<0xBC, MRMSrcReg, (outs GR32:$dst), (ins GR32:$src),
+def BSF32rr  : I<0xBC, MRMSrcReg, (outs GR32:$dst), (ins GR32:$fallback, GR32:$src),
                  "bsf{l}\t{$src, $dst|$dst, $src}",
-                 [(set GR32:$dst, EFLAGS, (X86bsf GR32:$src))]>,
+                 [(set GR32:$dst, EFLAGS, (X86bsf GR32:$fallback, GR32:$src))]>,
                  TB, OpSize32, Sched<[WriteBSF]>;
-def BSF32rm  : I<0xBC, MRMSrcMem, (outs GR32:$dst), (ins i32mem:$src),
+def BSF32rm  : I<0xBC, MRMSrcMem, (outs GR32:$dst), (ins GR32:$fallback, i32mem:$src),
                  "bsf{l}\t{$src, $dst|$dst, $src}",
-                 [(set GR32:$dst, EFLAGS, (X86bsf (loadi32 addr:$src)))]>,
+                 [(set GR32:$dst, EFLAGS, (X86bsf GR32:$fallback, (loadi32 addr:$src)))]>,
                  TB, OpSize32, Sched<[WriteBSFLd]>;
-def BSF64rr  : RI<0xBC, MRMSrcReg, (outs GR64:$dst), (ins GR64:$src),
+def BSF64rr  : RI<0xBC, MRMSrcReg, (outs GR64:$dst), (ins GR64:$fallback, GR64:$src),
                   "bsf{q}\t{$src, $dst|$dst, $src}",
-                  [(set GR64:$dst, EFLAGS, (X86bsf GR64:$src))]>,
+                  [(set GR64:$dst, EFLAGS, (X86bsf GR64:$fallback, GR64:$src))]>,
                   TB, Sched<[WriteBSF]>;
-def BSF64rm  : RI<0xBC, MRMSrcMem, (outs GR64:$dst), (ins i64mem:$src),
+def BSF64rm  : RI<0xBC, MRMSrcMem, (outs GR64:$dst), (ins GR64:$fallback, i64mem:$src),
                   "bsf{q}\t{$src, $dst|$dst, $src}",
-                  [(set GR64:$dst, EFLAGS, (X86bsf (loadi64 addr:$src)))]>,
+                  [(set GR64:$dst, EFLAGS, (X86bsf GR64:$fallback, (loadi64 addr:$src)))]>,
                   TB, Sched<[WriteBSFLd]>;
 
-def BSR16rr  : I<0xBD, MRMSrcReg, (outs GR16:$dst), (ins GR16:$src),
+def BSR16rr  : I<0xBD, MRMSrcReg, (outs GR16:$dst), (ins GR16:$fallback, GR16:$src),
                  "bsr{w}\t{$src, $dst|$dst, $src}",
-                 [(set GR16:$dst, EFLAGS, (X86bsr GR16:$src))]>,
+                 [(set GR16:$dst, EFLAGS, (X86bsr GR16:$fallback, GR16:$src))]>,
                  TB, OpSize16, Sched<[WriteBSR]>;
-def BSR16rm  : I<0xBD, MRMSrcMem, (outs GR16:$dst), (ins i16mem:$src),
+def BSR16rm  : I<0xBD, MRMSrcMem, (outs GR16:$dst), (ins GR16:$fallback, i16mem:$src),
                  "bsr{w}\t{$src, $dst|$dst, $src}",
-                 [(set GR16:$dst, EFLAGS, (X86bsr (loadi16 addr:$src)))]>,
+                 [(set GR16:$dst, EFLAGS, (X86bsr GR16:$fallback, (loadi16 addr:$src)))]>,
                  TB, OpSize16, Sched<[WriteBSRLd]>;
-def BSR32rr  : I<0xBD, MRMSrcReg, (outs GR32:$dst), (ins GR32:$src),
+def BSR32rr  : I<0xBD, MRMSrcReg, (outs GR32:$dst), (ins GR32:$fallback, GR32:$src),
                  "bsr{l}\t{$src, $dst|$dst, $src}",
-                 [(set GR32:$dst, EFLAGS, (X86bsr GR32:$src))]>,
+                 [(set GR32:$dst, EFLAGS, (X86bsr GR32:$fallback, GR32:$src))]>,
                  TB, OpSize32, Sched<[WriteBSR]>;
-def BSR32rm  : I<0xBD, MRMSrcMem, (outs GR32:$dst), (ins i32mem:$src),
+def BSR32rm  : I<0xBD, MRMSrcMem, (outs GR32:$dst), (ins GR32:$fallback, i32mem:$src),
                  "bsr{l}\t{$src, $dst|$dst, $src}",
-                 [(set GR32:$dst, EFLAGS, (X86bsr (loadi32 addr:$src)))]>,
+                 [(set GR32:$dst, EFLAGS, (X86bsr GR32:$fallback, (loadi32 addr:$src)))]>,
                  TB, OpSize32, Sched<[WriteBSRLd]>;
-def BSR64rr  : RI<0xBD, MRMSrcReg, (outs GR64:$dst), (ins GR64:$src),
+def BSR64rr  : RI<0xBD, MRMSrcReg, (outs GR64:$dst), (ins GR64:$fallback, GR64:$src),
                   "bsr{q}\t{$src, $dst|$dst, $src}",
-                  [(set GR64:$dst, EFLAGS, (X86bsr GR64:$src))]>,
+                  [(set GR64:$dst, EFLAGS, (X86bsr GR64:$fallback, GR64:$src))]>,
                   TB, Sched<[WriteBSR]>;
-def BSR64rm  : RI<0xBD, MRMSrcMem, (outs GR64:$dst), (ins i64mem:$src),
+def BSR64rm  : RI<0xBD, MRMSrcMem, (outs GR64:$dst), (ins GR64:$fallback, i64mem:$src),
                   "bsr{q}\t{$src, $dst|$dst, $src}",
-                  [(set GR64:$dst, EFLAGS, (X86bsr (loadi64 addr:$src)))]>,
+                  [(set GR64:$dst, EFLAGS, (X86bsr GR64:$fallback, (loadi64 addr:$src)))]>,
                   TB, Sched<[WriteBSRLd]>;
 } // Defs = [EFLAGS]
 
diff --git a/llvm/lib/Target/X86/X86Subtarget.h b/llvm/lib/Target/X86/X86Subtarget.h
index e3cb9ee8ce1909..c399989f115d75 100644
--- a/llvm/lib/Target/X86/X86Subtarget.h
+++ b/llvm/lib/Target/X86/X86Subtarget.h
@@ -263,6 +263,11 @@ class X86Subtarget final : public X86GenSubtargetInfo {
     return hasBWI() && useAVX512Regs();
   }
 
+  // Returns true if the destination register of a BSF/BSR instruction is
+  // not touched if the source register is zero.
+  // NOTE: i32->i64 implicit zext isn't guaranteed by BSR/BSF pass through.
+  bool hasBitScanPassThrough() const { return is64Bit(); }
+
   bool isXRaySupported() const override { return is64Bit(); }
 
   /// Use clflush if we have SSE2 or we're on x86-64 (even if we asked for
diff --git a/llvm/test/CodeGen/X86/bit_ceil.ll b/llvm/test/CodeGen/X86/bit_ceil.ll
index 823453087f6180..1f21fcac8341d5 100644
--- a/llvm/test/CodeGen/X86/bit_ceil.ll
+++ b/llvm/test/CodeGen/X86/bit_ceil.ll
@@ -10,9 +10,8 @@ define i32 @bit_ceil_i32(i32 %x) {
 ; NOBMI:       # %bb.0:
 ; NOBMI-NEXT:    # kill: def $edi killed $edi def $rdi
 ; NOBMI-NEXT:    leal -1(%rdi), %eax
-; NOBMI-NEXT:    bsrl %eax, %eax
 ; NOBMI-NEXT:    movl $63, %ecx
-; NOBMI-NEXT:    cmovnel %eax, %ecx
+; NOBMI-NEXT:    bsrl %eax, %ecx
 ; NOBMI-NEXT:    xorl $31, %ecx
 ; NOBMI-NEXT:    negb %cl
 ; NOBMI-NEXT:    movl $1, %edx
@@ -47,9 +46,8 @@ define i32 @bit_ceil_i32(i32 %x) {
 define i32 @bit_ceil_i32_plus1(i32 noundef %x) {
 ; NOBMI-LABEL: bit_ceil_i32_plus1:
 ; NOBMI:       # %bb.0: # %entry
-; NOBMI-NEXT:    bsrl %edi, %eax
 ; NOBMI-NEXT:    movl $63, %ecx
-; NOBMI-NEXT:    cmovnel %eax, %ecx
+; NOBMI-NEXT:    bsrl %edi, %ecx
 ; NOBMI-NEXT:    xorl $31, %ecx
 ; NOBMI-NEXT:    negb %cl
 ; NOBMI-NEXT:    movl $1, %edx
@@ -86,9 +84,8 @@ define i64 @bit_ceil_i64(i64 %x) {
 ; NOBMI-LABEL: bit_ceil_i64:
 ; NOBMI:       # %bb.0:
 ; NOBMI-NEXT:    leaq -1(%rdi), %rax
-; NOBMI-NEXT:    bsrq %rax, %rax
 ; NOBMI-NEXT:    movl $127, %ecx
-; NOBMI-NEXT:    cmovneq %rax, %rcx
+; NOBMI-NEXT:    bsrq %rax, %rcx
 ; NOBMI-NEXT:    xorl $63, %ecx
 ; NOBMI-NEXT:    negb %cl
 ; NOBMI-NEXT:    movl $1, %edx
@@ -122,9 +119,8 @@ define i64 @bit_ceil_i64(i64 %x) {
 define i64 @bit_ceil_i64_plus1(i64 noundef %x) {
 ; NOBMI-LABEL: bit_ceil_i64_plus1:
 ; NOBMI:       # %bb.0: # %entry
-; NOBMI-NEXT:    bsrq %rdi, %rax
 ; NOBMI-NEXT:    movl $127, %ecx
-; NOBMI-NEXT:    cmovneq %rax, %rcx
+; NOBMI-NEXT:    bsrq %rdi, %rcx
 ; NOBMI-NEXT:    xorl $63, %ecx
 ; NOBMI-NEXT:    negb %cl
 ; NOBMI-NEXT:    movl $1, %edx
diff --git a/llvm/test/CodeGen/X86/combine-or.ll b/llvm/test/CodeGen/X86/combine-or.ll
index d9c6d7053be746..08262e4d34b269 100644
--- a/llvm/test/CodeGen/X86/combine-or.ll
+++ b/llvm/test/CodeGen/X86/combine-or.ll
@@ -227,9 +227,8 @@ define i64 @PR89533(<64 x i8> %a0) {
 ; SSE-NEXT:    orl %eax, %edx
 ; SSE-NEXT:    shlq $32, %rdx
 ; SSE-NEXT:    orq %rcx, %rdx
-; SSE-NEXT:    bsfq %rdx, %rcx
 ; SSE-NEXT:    movl $64, %eax
-; SSE-NEXT:    cm...
[truncated]

@KanRobert
Copy link
Contributor

BSF/BSF -> BSF/BSR in both title and description?

@RKSimon RKSimon changed the title [X86] Handle BSF/BSF "zero-input pass through" behaviour [X86] Handle BSF/BSR "zero-input pass through" behaviour Jan 21, 2025
Comment on lines +5395 to +5397
std::tie(NewCC, OpNo) = isUseDefConvertible(Inst);
if (NewCC != X86::COND_INVALID && Inst.getOperand(OpNo).isReg() &&
Inst.getOperand(OpNo).getReg() == SrcReg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't put PassThru in the last operand? We don't need to change it then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is at the MI level, and the passthrough/fallback operand is tied to the destination reg (like any x86 binop) - it'd be weird to have a commutation and I expect could lead to further problems.

; SSE-NEXT: movl $64, %eax
; SSE-NEXT: cmovneq %rcx, %rax
; SSE-NEXT: rep bsfq %rdx, %rax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need rep here? The same below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "REP BSF" so that it can be recognized as TZCNT on BMI capable machines, which is a lot quicker than BSF - it uses this pattern as we no longer have any EFLAGS dependency. In that case we just pay the trivial penalty of the extra MOV.

@@ -15,12 +15,12 @@ bsf %rax, %rcx

# CHECK: Iterations: 100
# CHECK-NEXT: Instructions: 400
# CHECK-NEXT: Total Cycles: 655
# CHECK-NEXT: Total Cycles: 663
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it affected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's b/c the instruction has 1 more source operand now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because llvm-mca is too smart :) It can now tell that the passthrough source has a dependency on the destination

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense somehow. The dependency is increased indeed. Fortunately BSF doesn't have partial dependency issue.

Copy link
Contributor

@phoebewang phoebewang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -5220,42 +5220,42 @@ inline static bool isDefConvertible(const MachineInstr &MI, bool &NoSignFlag,
}

/// Check whether the use can be converted to remove a comparison against zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a comment for the returned pair?

@RKSimon RKSimon merged commit 90e9895 into llvm:main Jan 23, 2025
7 of 8 checks passed
@RKSimon RKSimon deleted the x86-bitscan-fallthrough branch January 23, 2025 13:00
RKSimon added a commit to RKSimon/llvm-project that referenced this pull request Jan 24, 2025
Followup to llvm#123623 - now that the CMOV has been removed, the throughput has improved, reducing the benefit of vectorization on pre-x86-64-v3 CPUs
RKSimon added a commit that referenced this pull request Jan 26, 2025
Followup to #123623 - now that the CMOV has been removed, the throughput has improved, reducing the benefit of vectorization on pre-x86-64-v3 CPUs
phoebewang added a commit to phoebewang/llvm-project that referenced this pull request Feb 22, 2025
Similar to D14748, we can relax lzcnt intrinsics too, especially with
improved BSR lowering by llvm#123623
phoebewang added a commit that referenced this pull request Feb 22, 2025
Similar to D14748, we can relax lzcnt intrinsics too, especially with
improved BSR lowering by #123623
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suboptimal codegen for llvm.cttz on x86_64 (no BMI1)
4 participants