[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645

lukel97 · 2025-03-04T05:29:23Z

After #124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic.

However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice.

This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.

…Splice After llvm#124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic. However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice. This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it. We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.

llvmbot · 2025-03-04T05:29:55Z

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

After #124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic.

However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice.

This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it.

We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.

Full diff: https://github.com/llvm/llvm-project/pull/129645.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+12-8)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll (+2-1)

diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index e9f50e88867b2..5ce4d2ae6ff53 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -743,6 +743,17 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
     return Ctx.TTI.getArithmeticReductionCost(
         Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
   }
+  case VPInstruction::FirstOrderRecurrenceSplice: {
+    assert(VF.isVector());
+    SmallVector<int> Mask(VF.getKnownMinValue());
+    std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
+    Type *VectorTy =
+        toVectorTy(Ctx.Types.inferScalarType(this->getVPSingleValue()), VF);
+
+    return Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
+                                  cast<VectorType>(VectorTy), Mask,
+                                  Ctx.CostKind, VF.getKnownMinValue() - 1);
+  }
   default:
     // TODO: Compute cost other VPInstructions once the legacy cost model has
     // been retired.
@@ -3463,14 +3474,7 @@ VPFirstOrderRecurrencePHIRecipe::computeCost(ElementCount VF,
   if (VF.isScalable() && VF.getKnownMinValue() == 1)
     return InstructionCost::getInvalid();
 
-  SmallVector<int> Mask(VF.getKnownMinValue());
-  std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
-  Type *VectorTy =
-      toVectorTy(Ctx.Types.inferScalarType(this->getVPSingleValue()), VF);
-
-  return Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
-                                cast<VectorType>(VectorTy), Mask, Ctx.CostKind,
-                                VF.getKnownMinValue() - 1);
+  return 0;
 }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
index 0bcfe13832ae7..deeff38b1fe78 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
@@ -51,7 +51,8 @@ define void @first_order_recurrence(ptr noalias %A, ptr noalias %B, i64 %TC) {
 ; IF-EVL-NEXT:   EMIT vp<[[RESUME_EXTRACT:%.+]]> = extract-from-end ir<[[LD]]>, ir<1>
 ; IF-EVL-NEXT:   EMIT branch-on-cond ir<true>
 ; IF-EVL-NEXT: Successor(s): ir-bb<for.end>, scalar.ph
-
+; IF-EVL: Cost of 0 for VF vscale x 4: FIRST-ORDER-RECURRENCE-PHI ir<[[FOR_PHI]]> = phi ir<33>, ir<[[LD]]>
+; IF-EVL: Cost of 4 for VF vscale x 4: WIDEN-INTRINSIC vp<[[SPLICE]]> = call llvm.experimental.vp.splice(ir<[[FOR_PHI]]>, ir<[[LD]]>, ir<-1>, ir<true>, vp<[[PREV_EVL]]>, vp<[[EVL]]>)
 entry:
   br label %for.body

fhahn

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

lukel97 · 2025-03-04T08:46:32Z

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with +tune-dlen-factor2/adding predication the vector costs seem to outweigh the scalar. We also can't seem to EVL vectorize a first order recurrence of i64 yet either, which I thought might increase the vector cost, because it ends up creating a VPWidenIntOrFpInductionRecipe, which we can't handle yet without #115274 / #118638

Mel-Chen · 2025-03-05T09:32:54Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -743,6 +743,17 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
    return Ctx.TTI.getArithmeticReductionCost(
        Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
  }
+  case VPInstruction::FirstOrderRecurrenceSplice: {
+    assert(VF.isVector());


Need some assertion message here.

Mel-Chen · 2025-03-05T10:04:48Z

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with +tune-dlen-factor2/adding predication the vector costs seem to outweigh the scalar. We also can't seem to EVL vectorize a first order recurrence of i64 yet either, which I thought might increase the vector cost, because it ends up creating a VPWidenIntOrFpInductionRecipe, which we can't handle yet without #115274 / #118638

I didn't get this point. Why would FOR be related to VPWidenIntOrFpInductionRecipe? Could you provide a more details or examples?

lukel97 · 2025-03-05T10:24:20Z

The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?

I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with +tune-dlen-factor2/adding predication the vector costs seem to outweigh the scalar. We also can't seem to EVL vectorize a first order recurrence of i64 yet either, which I thought might increase the vector cost, because it ends up creating a VPWidenIntOrFpInductionRecipe, which we can't handle yet without #115274 / #118638

I didn't get this point. Why would FOR be related to VPWidenIntOrFpInductionRecipe? Could you provide a more details or examples?

For this loop where %for1 is i64:

define void @first_order_recurrence(ptr noalias %A, ptr noalias %B, i64 %TC) {

entry:
  br label %for.body

for.body:
  %indvars = phi i64 [ 0, %entry ], [ %indvars.next, %for.body ]
  %for1 = phi i64 [ 33, %entry ], [ %0, %for.body ]
  %arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
  %0 = load i64, ptr %arrayidx, align 4
  %add = add nsw i64 %for1, %0
  %arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
  store i64 %add, ptr %arrayidx2, align 4
  %indvars.next = add nuw nsw i64 %indvars, 1
  %exitcond.not = icmp eq i64 %indvars.next, %TC
  br i1 %exitcond.not, label %for.end, label %for.body

for.end:
  ret void
}

opt -S -o - -mattr=+v -mtriple=riscv64 -passes=loop-vectorize -force-tail-folding-style=data-with-evl -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

The VPlan coming into VPlanTransforms::tryAddExplicitVectorLength seems to have a VPWidenIntOrFpInductionRecipe, used for predicating?

VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2},UF>=1' {
Live-in vp<%0> = VF
Live-in vp<%1> = VF * UF
Live-in vp<%2> = vector-trip-count
Live-in vp<%3> = backedge-taken count
Live-in ir<%TC> = original trip-count

ir-bb<entry>:
Successor(s): vector.ph

vector.ph:
Successor(s): vector loop

<x1> vector loop: {
  vector.body:
    EMIT vp<%4> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
    ir<%indvars> = WIDEN-INDUCTION  ir<0>, ir<1>, vp<%0>
    FIRST-ORDER-RECURRENCE-PHI ir<%for1> = phi ir<33>, vp<%6>
    EMIT vp<%5> = icmp ule ir<%indvars>, vp<%3>
    WIDEN-GEP Inv[Var] ir<%arrayidx> = getelementptr inbounds nuw ir<%A>, ir<%indvars>
  Successor(s): pred.load

  <xVFxUF> pred.load: {
    pred.load.entry:
      BRANCH-ON-MASK vp<%5>
    Successor(s): pred.load.if, pred.load.continue

    pred.load.if:
      REPLICATE ir<%0> = load ir<%arrayidx> (S->V)
    Successor(s): pred.load.continue

    pred.load.continue:
      PHI-PREDICATED-INSTRUCTION vp<%6> = ir<%0>
    No successors
  }
  Successor(s): for.body.0

  for.body.0:
    EMIT vp<%7> = first-order splice ir<%for1>, vp<%6>
    WIDEN ir<%add> = add nsw vp<%7>, vp<%6>
    WIDEN-GEP Inv[Var] ir<%arrayidx2> = getelementptr inbounds nuw ir<%B>, ir<%indvars>
  Successor(s): pred.store

  <xVFxUF> pred.store: {
    pred.store.entry:
      BRANCH-ON-MASK vp<%5>
    Successor(s): pred.store.if, pred.store.continue

    pred.store.if:
      REPLICATE store ir<%add>, ir<%arrayidx2>
    Successor(s): pred.store.continue

    pred.store.continue:
    No successors
  }
  Successor(s): for.body.1

  for.body.1:
    EMIT vp<%index.next> = add vp<%4>, vp<%1>
    EMIT branch-on-count vp<%index.next>, vp<%2>
  No successors
}

I'm not sure why we're predicating this to begin with, the VPlan when %for1 is i32 makes more sense to me:

VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2,vscale x 4},UF>=1' {
Live-in vp<%0> = VF * UF
Live-in vp<%1> = vector-trip-count
Live-in vp<%2> = backedge-taken count
Live-in ir<%TC> = original trip-count

ir-bb<entry>:
Successor(s): vector.ph

vector.ph:
Successor(s): vector loop

<x1> vector loop: {
  vector.body:
    EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
    FIRST-ORDER-RECURRENCE-PHI ir<%for1> = phi ir<33>, ir<%0>
    vp<%4> = SCALAR-STEPS vp<%3>, ir<1>
    EMIT vp<%5> = WIDEN-CANONICAL-INDUCTION vp<%3>
    EMIT vp<%6> = icmp ule vp<%5>, vp<%2>
    CLONE ir<%arrayidx> = getelementptr inbounds nuw ir<%A>, vp<%4>
    vp<%7> = vector-pointer ir<%arrayidx>
    WIDEN ir<%0> = load vp<%7>, vp<%6>
    EMIT vp<%8> = first-order splice ir<%for1>, ir<%0>
    WIDEN ir<%add> = add nsw vp<%8>, ir<%0>
    CLONE ir<%arrayidx2> = getelementptr inbounds nuw ir<%B>, vp<%4>
    vp<%9> = vector-pointer ir<%arrayidx2>
    WIDEN store vp<%9>, ir<%add>, vp<%6>
    EMIT vp<%index.next> = add vp<%3>, vp<%0>
    EMIT branch-on-count vp<%index.next>, vp<%1>
  No successors
}
Successor(s): middle.block

middle.block:
  EMIT vp<%vector.recur.extract> = extract-from-end ir<%0>, ir<1>
  EMIT branch-on-cond ir<true>
Successor(s): ir-bb<for.end>, scalar.ph

scalar.ph:
  EMIT vp<%bc.resume.val> = resume-phi vp<%1>, ir<0>
  EMIT vp<%scalar.recur.init> = resume-phi vp<%vector.recur.extract>, ir<33>
Successor(s): ir-bb<for.body>

ir-bb<for.body>:
  IR   %indvars = phi i64 [ 0, %entry ], [ %indvars.next, %for.body ] (extra operand: vp<%bc.resume.val> from scalar.ph)
  IR   %for1 = phi i32 [ 33, %entry ], [ %0, %for.body ] (extra operand: vp<%scalar.recur.init> from scalar.ph)
  IR   %arrayidx = getelementptr inbounds nuw i32, ptr %A, i64 %indvars
  IR   %0 = load i32, ptr %arrayidx, align 4
  IR   %add = add nsw i32 %for1, %0
  IR   %arrayidx2 = getelementptr inbounds nuw i32, ptr %B, i64 %indvars
  IR   store i32 %add, ptr %arrayidx2, align 4
  IR   %indvars.next = add nuw nsw i64 %indvars, 1
  IR   %exitcond.not = icmp eq i64 %indvars.next, %TC
No successors

ir-bb<for.end>:
No successors
}

…ze/evl-fixed-order-recurrence-cost

lukel97 · 2025-03-13T08:28:01Z

Ping

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

fhahn

LGTM, thanks

Mel-Chen · 2025-03-14T07:09:03Z

%arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
%0 = load i64, ptr %arrayidx, align 4
%add = add nsw i64 %for1, %0
%arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
store i64 %add, ptr %arrayidx2, align 4

I guess the issue might be due to load/store alignment. Try changing it to align 8 and try again.

lukel97 · 2025-03-14T07:32:57Z

%arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
%0 = load i64, ptr %arrayidx, align 4
%add = add nsw i64 %for1, %0
%arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
store i64 %add, ptr %arrayidx2, align 4

I guess the issue might be due to load/store alignment. Try changing it to align 8 and try again.

Argh that's exactly it, I think I have align attribute blindness. Ignore my previous comments!

Fixes llvm#131359 After llvm#129645, a first-order recurrence will no longer have it's splice costed if the VPInstruction::FirstOrderRecurrenceSplice has no users and is dead. The legacy cost model didn't account for this, so update this to avoid the "VPlan cost model and legacy cost model disagreed" assertion. Alternatively we could also account for this in planContainsAdditionalSimplifications

…1486) Fixes #131359 After #129645, a first-order recurrence will no longer have it's splice costed if the VPInstruction::FirstOrderRecurrenceSplice has no users and is dead. The legacy cost model didn't account for this, so this accounts for it in planContainsAdditionalSimplifications to avoid the "VPlan cost model and legacy cost model disagreed" assertion.

…Splice (llvm#129645) After llvm#124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic. However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice. This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it. We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.

ro-i · 2025-04-13T14:28:25Z

This commit (26324bc) breaks building the blender SPEC benchmark (https://www.spec.org/cpu2017/Docs/benchmarks/526.blender_r.html).
The crash information:
crash.txt

The reduced input file:

define float @normal_poly_v3(ptr %n, ptr %verts, ptr %v_prev.016, i64 %wide.trip.count) {
entry:
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %v_prev.0163 = phi ptr [ %verts, %entry ], [ %v_curr.017, %for.body ]
  %v_curr.017 = getelementptr nusw [3 x float], ptr %verts, i64 %indvars.iv
  %0 = load float, ptr %v_curr.017, align 4
  store float %0, ptr %n, align 4
  %1 = load float, ptr %v_prev.0163, align 4
  store float %1, ptr %v_prev.016, align 4
  %indvars.iv.next = add nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv, %wide.trip.count
  br i1 %exitcond.not, label %for.end.loopexit, label %for.body

for.end.loopexit:                                 ; preds = %for.body
  ret float 0.000000e+00
}

ronlieb · 2025-04-16T11:59:07Z

@lukel97 plz see above issue ?

Fixed-order recurrence phis cannot be forced to be scalar, they will always be widened at the moment. Make sure we don't add them to ForcedScalars, otherwise the legacy cost model will compute incorrect costs. This fixes an assertion reported with #129645.

fhahn · 2025-04-16T20:59:27Z

Looks like the issue was that the legacy cost model would treat first-order recurrence phis as forced scalars, but they cannot be scalarized, they will always generate a wide recipe. Pushed 41c1a7b to fix the legacy cost model to not force such phis as scalars.

Fixed-order recurrence phis cannot be forced to be scalar, they will always be widened at the moment. Make sure we don't add them to ForcedScalars, otherwise the legacy cost model will compute incorrect costs. This fixes an assertion reported with llvm/llvm-project#129645.

Fixed-order recurrence phis cannot be forced to be scalar, they will always be widened at the moment. Make sure we don't add them to ForcedScalars, otherwise the legacy cost model will compute incorrect costs. This fixes an assertion reported with llvm#129645.

lukel97 added 2 commits March 4, 2025 12:46

Precommit test

a3d6cf4

lukel97 requested review from fhahn, Mel-Chen and preames March 4, 2025 05:29

llvmbot added vectorizers llvm:transforms labels Mar 4, 2025

fhahn reviewed Mar 4, 2025

View reviewed changes

Mel-Chen reviewed Mar 5, 2025

View reviewed changes

lukel97 added 2 commits March 5, 2025 18:41

Add assertion message

6e4fd8f

Merge branch 'main' of github.com:llvm/llvm-project into loop-vectori…

508ebe9

…ze/evl-fixed-order-recurrence-cost

alexey-bataev approved these changes Mar 13, 2025

View reviewed changes

fhahn reviewed Mar 13, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

Remove getVPSingleValue

c5f5681

fhahn approved these changes Mar 13, 2025

View reviewed changes

lukel97 merged commit 26324bc into llvm:main Mar 14, 2025
8 of 11 checks passed

lukel97 mentioned this pull request Mar 16, 2025

[VPlan] Account for dead FOR splice simplification in cost model #131486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645

[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645

lukel97 commented Mar 4, 2025

llvmbot commented Mar 4, 2025 •

edited

Loading

fhahn left a comment

lukel97 commented Mar 4, 2025

Mel-Chen Mar 5, 2025

Mel-Chen commented Mar 5, 2025

lukel97 commented Mar 5, 2025 •

edited

Loading

lukel97 commented Mar 13, 2025

fhahn left a comment

Mel-Chen commented Mar 14, 2025

lukel97 commented Mar 14, 2025

ro-i commented Apr 13, 2025

ronlieb commented Apr 16, 2025

fhahn commented Apr 16, 2025

[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645

[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645

Conversation

lukel97 commented Mar 4, 2025

llvmbot commented Mar 4, 2025 • edited Loading

fhahn left a comment

Choose a reason for hiding this comment

lukel97 commented Mar 4, 2025

Mel-Chen Mar 5, 2025

Choose a reason for hiding this comment

Mel-Chen commented Mar 5, 2025

lukel97 commented Mar 5, 2025 • edited Loading

lukel97 commented Mar 13, 2025

fhahn left a comment

Choose a reason for hiding this comment

Mel-Chen commented Mar 14, 2025

lukel97 commented Mar 14, 2025

ro-i commented Apr 13, 2025

ronlieb commented Apr 16, 2025

fhahn commented Apr 16, 2025

llvmbot commented Mar 4, 2025 •

edited

Loading

lukel97 commented Mar 5, 2025 •

edited

Loading