Skip to content

[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Mar 4, 2025

After #124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic.

However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice.

This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.

lukel97 added 2 commits March 4, 2025 12:46
…Splice

After llvm#124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic.

However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice.

This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.
@llvmbot
Copy link
Member

llvmbot commented Mar 4, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

After #124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic.

However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice.

This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.


Full diff: https://github.com/llvm/llvm-project/pull/129645.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+12-8)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll (+2-1)
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index e9f50e88867b2..5ce4d2ae6ff53 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -743,6 +743,17 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
     return Ctx.TTI.getArithmeticReductionCost(
         Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
   }
+  case VPInstruction::FirstOrderRecurrenceSplice: {
+    assert(VF.isVector());
+    SmallVector<int> Mask(VF.getKnownMinValue());
+    std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
+    Type *VectorTy =
+        toVectorTy(Ctx.Types.inferScalarType(this->getVPSingleValue()), VF);
+
+    return Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
+                                  cast<VectorType>(VectorTy), Mask,
+                                  Ctx.CostKind, VF.getKnownMinValue() - 1);
+  }
   default:
     // TODO: Compute cost other VPInstructions once the legacy cost model has
     // been retired.
@@ -3463,14 +3474,7 @@ VPFirstOrderRecurrencePHIRecipe::computeCost(ElementCount VF,
   if (VF.isScalable() && VF.getKnownMinValue() == 1)
     return InstructionCost::getInvalid();
 
-  SmallVector<int> Mask(VF.getKnownMinValue());
-  std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
-  Type *VectorTy =
-      toVectorTy(Ctx.Types.inferScalarType(this->getVPSingleValue()), VF);
-
-  return Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
-                                cast<VectorType>(VectorTy), Mask, Ctx.CostKind,
-                                VF.getKnownMinValue() - 1);
+  return 0;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
index 0bcfe13832ae7..deeff38b1fe78 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
@@ -51,7 +51,8 @@ define void @first_order_recurrence(ptr noalias %A, ptr noalias %B, i64 %TC) {
 ; IF-EVL-NEXT:   EMIT vp<[[RESUME_EXTRACT:%.+]]> = extract-from-end ir<[[LD]]>, ir<1>
 ; IF-EVL-NEXT:   EMIT branch-on-cond ir<true>
 ; IF-EVL-NEXT: Successor(s): ir-bb<for.end>, scalar.ph
-
+; IF-EVL: Cost of 0 for VF vscale x 4: FIRST-ORDER-RECURRENCE-PHI ir<[[FOR_PHI]]> = phi ir<33>, ir<[[LD]]>
+; IF-EVL: Cost of 4 for VF vscale x 4: WIDEN-INTRINSIC vp<[[SPLICE]]> = call llvm.experimental.vp.splice(ir<[[FOR_PHI]]>, ir<[[LD]]>, ir<-1>, ir<true>, vp<[[PREV_EVL]]>, vp<[[EVL]]>)
 entry:
   br label %for.body
 

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 4, 2025

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with +tune-dlen-factor2/adding predication the vector costs seem to outweigh the scalar. We also can't seem to EVL vectorize a first order recurrence of i64 yet either, which I thought might increase the vector cost, because it ends up creating a VPWidenIntOrFpInductionRecipe, which we can't handle yet without #115274 / #118638

@@ -743,6 +743,17 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
return Ctx.TTI.getArithmeticReductionCost(
Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
}
case VPInstruction::FirstOrderRecurrenceSplice: {
assert(VF.isVector());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some assertion message here.

@Mel-Chen
Copy link
Contributor

Mel-Chen commented Mar 5, 2025

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with +tune-dlen-factor2/adding predication the vector costs seem to outweigh the scalar. We also can't seem to EVL vectorize a first order recurrence of i64 yet either, which I thought might increase the vector cost, because it ends up creating a VPWidenIntOrFpInductionRecipe, which we can't handle yet without #115274 / #118638

I didn't get this point. Why would FOR be related to VPWidenIntOrFpInductionRecipe? Could you provide a more details or examples?

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 5, 2025

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with +tune-dlen-factor2/adding predication the vector costs seem to outweigh the scalar. We also can't seem to EVL vectorize a first order recurrence of i64 yet either, which I thought might increase the vector cost, because it ends up creating a VPWidenIntOrFpInductionRecipe, which we can't handle yet without #115274 / #118638

I didn't get this point. Why would FOR be related to VPWidenIntOrFpInductionRecipe? Could you provide a more details or examples?

For this loop where %for1 is i64:

define void @first_order_recurrence(ptr noalias %A, ptr noalias %B, i64 %TC) {

entry:
  br label %for.body

for.body:
  %indvars = phi i64 [ 0, %entry ], [ %indvars.next, %for.body ]
  %for1 = phi i64 [ 33, %entry ], [ %0, %for.body ]
  %arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
  %0 = load i64, ptr %arrayidx, align 4
  %add = add nsw i64 %for1, %0
  %arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
  store i64 %add, ptr %arrayidx2, align 4
  %indvars.next = add nuw nsw i64 %indvars, 1
  %exitcond.not = icmp eq i64 %indvars.next, %TC
  br i1 %exitcond.not, label %for.end, label %for.body

for.end:
  ret void
}

opt -S -o - -mattr=+v -mtriple=riscv64 -passes=loop-vectorize -force-tail-folding-style=data-with-evl -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

The VPlan coming into VPlanTransforms::tryAddExplicitVectorLength seems to have a VPWidenIntOrFpInductionRecipe, used for predicating?

VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2},UF>=1' {
Live-in vp<%0> = VF
Live-in vp<%1> = VF * UF
Live-in vp<%2> = vector-trip-count
Live-in vp<%3> = backedge-taken count
Live-in ir<%TC> = original trip-count

ir-bb<entry>:
Successor(s): vector.ph

vector.ph:
Successor(s): vector loop

<x1> vector loop: {
  vector.body:
    EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
    ir<%indvars> = WIDEN-INDUCTION  ir<0>, ir<1>, vp<%0>
    FIRST-ORDER-RECURRENCE-PHI ir<%for1> = phi ir<33>, vp<%6>
    EMIT vp<%5> = icmp ule ir<%indvars>, vp<%3>
    WIDEN-GEP Inv[Var] ir<%arrayidx> = getelementptr inbounds nuw ir<%A>, ir<%indvars>
  Successor(s): pred.load

  <xVFxUF> pred.load: {
    pred.load.entry:
      BRANCH-ON-MASK vp<%5>
    Successor(s): pred.load.if, pred.load.continue

    pred.load.if:
      REPLICATE ir<%0> = load ir<%arrayidx> (S->V)
    Successor(s): pred.load.continue

    pred.load.continue:
      PHI-PREDICATED-INSTRUCTION vp<%6> = ir<%0>
    No successors
  }
  Successor(s): for.body.0

  for.body.0:
    EMIT vp<%7> = first-order splice ir<%for1>, vp<%6>
    WIDEN ir<%add> = add nsw vp<%7>, vp<%6>
    WIDEN-GEP Inv[Var] ir<%arrayidx2> = getelementptr inbounds nuw ir<%B>, ir<%indvars>
  Successor(s): pred.store

  <xVFxUF> pred.store: {
    pred.store.entry:
      BRANCH-ON-MASK vp<%5>
    Successor(s): pred.store.if, pred.store.continue

    pred.store.if:
      REPLICATE store ir<%add>, ir<%arrayidx2>
    Successor(s): pred.store.continue

    pred.store.continue:
    No successors
  }
  Successor(s): for.body.1

  for.body.1:
    EMIT vp<%index.next> = add vp<%4>, vp<%1>
    EMIT branch-on-count vp<%index.next>, vp<%2>
  No successors
}

I'm not sure why we're predicating this to begin with, the VPlan when %for1 is i32 makes more sense to me:

VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2,vscale x 4},UF>=1' {
Live-in vp<%0> = VF * UF
Live-in vp<%1> = vector-trip-count
Live-in vp<%2> = backedge-taken count
Live-in ir<%TC> = original trip-count

ir-bb<entry>:
Successor(s): vector.ph

vector.ph:
Successor(s): vector loop

<x1> vector loop: {
  vector.body:
    EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
    FIRST-ORDER-RECURRENCE-PHI ir<%for1> = phi ir<33>, ir<%0>
    vp<%4> = SCALAR-STEPS vp<%3>, ir<1>
    EMIT vp<%5> = WIDEN-CANONICAL-INDUCTION vp<%3>
    EMIT vp<%6> = icmp ule vp<%5>, vp<%2>
    CLONE ir<%arrayidx> = getelementptr inbounds nuw ir<%A>, vp<%4>
    vp<%7> = vector-pointer ir<%arrayidx>
    WIDEN ir<%0> = load vp<%7>, vp<%6>
    EMIT vp<%8> = first-order splice ir<%for1>, ir<%0>
    WIDEN ir<%add> = add nsw vp<%8>, ir<%0>
    CLONE ir<%arrayidx2> = getelementptr inbounds nuw ir<%B>, vp<%4>
    vp<%9> = vector-pointer ir<%arrayidx2>
    WIDEN store vp<%9>, ir<%add>, vp<%6>
    EMIT vp<%index.next> = add vp<%3>, vp<%0>
    EMIT branch-on-count vp<%index.next>, vp<%1>
  No successors
}
Successor(s): middle.block

middle.block:
  EMIT vp<%vector.recur.extract> = extract-from-end ir<%0>, ir<1>
  EMIT branch-on-cond ir<true>
Successor(s): ir-bb<for.end>, scalar.ph

scalar.ph:
  EMIT vp<%bc.resume.val> = resume-phi vp<%1>, ir<0>
  EMIT vp<%scalar.recur.init> = resume-phi vp<%vector.recur.extract>, ir<33>
Successor(s): ir-bb<for.body>

ir-bb<for.body>:
  IR   %indvars = phi i64 [ 0, %entry ], [ %indvars.next, %for.body ] (extra operand: vp<%bc.resume.val> from scalar.ph)
  IR   %for1 = phi i32 [ 33, %entry ], [ %0, %for.body ] (extra operand: vp<%scalar.recur.init> from scalar.ph)
  IR   %arrayidx = getelementptr inbounds nuw i32, ptr %A, i64 %indvars
  IR   %0 = load i32, ptr %arrayidx, align 4
  IR   %add = add nsw i32 %for1, %0
  IR   %arrayidx2 = getelementptr inbounds nuw i32, ptr %B, i64 %indvars
  IR   store i32 %add, ptr %arrayidx2, align 4
  IR   %indvars.next = add nuw nsw i64 %indvars, 1
  IR   %exitcond.not = icmp eq i64 %indvars.next, %TC
No successors

ir-bb<for.end>:
No successors
}

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 13, 2025

Ping

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@Mel-Chen
Copy link
Contributor

%arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
%0 = load i64, ptr %arrayidx, align 4
%add = add nsw i64 %for1, %0
%arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
store i64 %add, ptr %arrayidx2, align 4

I guess the issue might be due to load/store alignment. Try changing it to align 8 and try again.

@lukel97
Copy link
Contributor Author

lukel97 commented Mar 14, 2025

%arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
%0 = load i64, ptr %arrayidx, align 4
%add = add nsw i64 %for1, %0
%arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
store i64 %add, ptr %arrayidx2, align 4

I guess the issue might be due to load/store alignment. Try changing it to align 8 and try again.

Argh that's exactly it, I think I have align attribute blindness. Ignore my previous comments!

@lukel97 lukel97 merged commit 26324bc into llvm:main Mar 14, 2025
8 of 11 checks passed
lukel97 added a commit to lukel97/llvm-project that referenced this pull request Mar 16, 2025
Fixes llvm#131359

After llvm#129645, a first-order recurrence will no longer have it's splice costed if the VPInstruction::FirstOrderRecurrenceSplice has no users and is dead.

The legacy cost model didn't account for this, so update this to avoid the "VPlan cost model and legacy cost model disagreed" assertion.

Alternatively we could also account for this in planContainsAdditionalSimplifications
lukel97 added a commit that referenced this pull request Mar 17, 2025
…1486)

Fixes #131359

After #129645, a first-order recurrence will no longer have it's splice
costed if the VPInstruction::FirstOrderRecurrenceSplice has no users and
is dead.

The legacy cost model didn't account for this, so this accounts for it
in planContainsAdditionalSimplifications to avoid the "VPlan cost model
and legacy cost model disagreed" assertion.
frederik-h pushed a commit to frederik-h/llvm-project that referenced this pull request Mar 18, 2025
…Splice (llvm#129645)

After llvm#124093 we now support fixed-order recurrences with EVL tail
folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP
splice intrinsic.

However the costing for the splice is currently done in
VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic
we end up costing it twice.

This fixes it by splitting out the cost for the splice into
FirstOrderRecurrenceSplice so that it's not duplicated when we replace
it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe
since the splice might end up dead and discarded, e.g. in the test
@pr97452_scalable_vf1_for.
@ro-i
Copy link
Contributor

ro-i commented Apr 13, 2025

This commit (26324bc) breaks building the blender SPEC benchmark (https://www.spec.org/cpu2017/Docs/benchmarks/526.blender_r.html).
The crash information:
crash.txt

The reduced input file:

define float @normal_poly_v3(ptr %n, ptr %verts, ptr %v_prev.016, i64 %wide.trip.count) {
entry:
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %v_prev.0163 = phi ptr [ %verts, %entry ], [ %v_curr.017, %for.body ]
  %v_curr.017 = getelementptr nusw [3 x float], ptr %verts, i64 %indvars.iv
  %0 = load float, ptr %v_curr.017, align 4
  store float %0, ptr %n, align 4
  %1 = load float, ptr %v_prev.0163, align 4
  store float %1, ptr %v_prev.016, align 4
  %indvars.iv.next = add nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv, %wide.trip.count
  br i1 %exitcond.not, label %for.end.loopexit, label %for.body

for.end.loopexit:                                 ; preds = %for.body
  ret float 0.000000e+00
}

@ronlieb
Copy link
Contributor

ronlieb commented Apr 16, 2025

@lukel97 plz see above issue ?

fhahn added a commit that referenced this pull request Apr 16, 2025
Fixed-order recurrence phis cannot be forced to be scalar, they will
always be widened at the moment.

Make sure we don't add them to ForcedScalars, otherwise the legacy cost
model will compute incorrect costs.

This fixes an assertion reported with
#129645.
@fhahn
Copy link
Contributor

fhahn commented Apr 16, 2025

Looks like the issue was that the legacy cost model would treat first-order recurrence phis as forced scalars, but they cannot be scalarized, they will always generate a wide recipe. Pushed 41c1a7b to fix the legacy cost model to not force such phis as scalars.

llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Apr 16, 2025
Fixed-order recurrence phis cannot be forced to be scalar, they will
always be widened at the moment.

Make sure we don't add them to ForcedScalars, otherwise the legacy cost
model will compute incorrect costs.

This fixes an assertion reported with
llvm/llvm-project#129645.
var-const pushed a commit to ldionne/llvm-project that referenced this pull request Apr 17, 2025
Fixed-order recurrence phis cannot be forced to be scalar, they will
always be widened at the moment.

Make sure we don't add them to ForcedScalars, otherwise the legacy cost
model will compute incorrect costs.

This fixes an assertion reported with
llvm#129645.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants