-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPlan] Move FOR splice cost into VPInstruction::FirstOrderRecurrenceSplice #129645
Conversation
…Splice After llvm#124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic. However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice. This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it. We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesAfter #124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic. However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice. This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it. We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for. Full diff: https://github.com/llvm/llvm-project/pull/129645.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index e9f50e88867b2..5ce4d2ae6ff53 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -743,6 +743,17 @@ InstructionCost VPInstruction::computeCost(ElementCount VF,
return Ctx.TTI.getArithmeticReductionCost(
Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind);
}
+ case VPInstruction::FirstOrderRecurrenceSplice: {
+ assert(VF.isVector());
+ SmallVector<int> Mask(VF.getKnownMinValue());
+ std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
+ Type *VectorTy =
+ toVectorTy(Ctx.Types.inferScalarType(this->getVPSingleValue()), VF);
+
+ return Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
+ cast<VectorType>(VectorTy), Mask,
+ Ctx.CostKind, VF.getKnownMinValue() - 1);
+ }
default:
// TODO: Compute cost other VPInstructions once the legacy cost model has
// been retired.
@@ -3463,14 +3474,7 @@ VPFirstOrderRecurrencePHIRecipe::computeCost(ElementCount VF,
if (VF.isScalable() && VF.getKnownMinValue() == 1)
return InstructionCost::getInvalid();
- SmallVector<int> Mask(VF.getKnownMinValue());
- std::iota(Mask.begin(), Mask.end(), VF.getKnownMinValue() - 1);
- Type *VectorTy =
- toVectorTy(Ctx.Types.inferScalarType(this->getVPSingleValue()), VF);
-
- return Ctx.TTI.getShuffleCost(TargetTransformInfo::SK_Splice,
- cast<VectorType>(VectorTy), Mask, Ctx.CostKind,
- VF.getKnownMinValue() - 1);
+ return 0;
}
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
diff --git a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
index 0bcfe13832ae7..deeff38b1fe78 100644
--- a/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
+++ b/llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics-fixed-order-recurrence.ll
@@ -51,7 +51,8 @@ define void @first_order_recurrence(ptr noalias %A, ptr noalias %B, i64 %TC) {
; IF-EVL-NEXT: EMIT vp<[[RESUME_EXTRACT:%.+]]> = extract-from-end ir<[[LD]]>, ir<1>
; IF-EVL-NEXT: EMIT branch-on-cond ir<true>
; IF-EVL-NEXT: Successor(s): ir-bb<for.end>, scalar.ph
-
+; IF-EVL: Cost of 0 for VF vscale x 4: FIRST-ORDER-RECURRENCE-PHI ir<[[FOR_PHI]]> = phi ir<33>, ir<[[LD]]>
+; IF-EVL: Cost of 4 for VF vscale x 4: WIDEN-INTRINSIC vp<[[SPLICE]]> = call llvm.experimental.vp.splice(ir<[[FOR_PHI]]>, ir<[[LD]]>, ir<-1>, ir<true>, vp<[[PREV_EVL]]>, vp<[[EVL]]>)
entry:
br label %for.body
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The move makes sense. Curious if it would be easy to add a test where this leads to different VF to be picked?
I gave this a try there but I couldn't really find a way. Even by doubling the vector cost with |
@@ -743,6 +743,17 @@ InstructionCost VPInstruction::computeCost(ElementCount VF, | |||
return Ctx.TTI.getArithmeticReductionCost( | |||
Instruction::Or, cast<VectorType>(VecTy), std::nullopt, Ctx.CostKind); | |||
} | |||
case VPInstruction::FirstOrderRecurrenceSplice: { | |||
assert(VF.isVector()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need some assertion message here.
I didn't get this point. Why would FOR be related to VPWidenIntOrFpInductionRecipe? Could you provide a more details or examples? |
For this loop where define void @first_order_recurrence(ptr noalias %A, ptr noalias %B, i64 %TC) {
entry:
br label %for.body
for.body:
%indvars = phi i64 [ 0, %entry ], [ %indvars.next, %for.body ]
%for1 = phi i64 [ 33, %entry ], [ %0, %for.body ]
%arrayidx = getelementptr inbounds nuw i64, ptr %A, i64 %indvars
%0 = load i64, ptr %arrayidx, align 4
%add = add nsw i64 %for1, %0
%arrayidx2 = getelementptr inbounds nuw i64, ptr %B, i64 %indvars
store i64 %add, ptr %arrayidx2, align 4
%indvars.next = add nuw nsw i64 %indvars, 1
%exitcond.not = icmp eq i64 %indvars.next, %TC
br i1 %exitcond.not, label %for.end, label %for.body
for.end:
ret void
}
The VPlan coming into
I'm not sure why we're predicating this to begin with, the VPlan when %for1 is i32 makes more sense to me:
|
…ze/evl-fixed-order-recurrence-cost
Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
I guess the issue might be due to load/store alignment. Try changing it to align 8 and try again. |
Argh that's exactly it, I think I have align attribute blindness. Ignore my previous comments! |
Fixes llvm#131359 After llvm#129645, a first-order recurrence will no longer have it's splice costed if the VPInstruction::FirstOrderRecurrenceSplice has no users and is dead. The legacy cost model didn't account for this, so update this to avoid the "VPlan cost model and legacy cost model disagreed" assertion. Alternatively we could also account for this in planContainsAdditionalSimplifications
…1486) Fixes #131359 After #129645, a first-order recurrence will no longer have it's splice costed if the VPInstruction::FirstOrderRecurrenceSplice has no users and is dead. The legacy cost model didn't account for this, so this accounts for it in planContainsAdditionalSimplifications to avoid the "VPlan cost model and legacy cost model disagreed" assertion.
…Splice (llvm#129645) After llvm#124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic. However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice. This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it. We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.
This commit (26324bc) breaks building the blender SPEC benchmark (https://www.spec.org/cpu2017/Docs/benchmarks/526.blender_r.html). The reduced input file: define float @normal_poly_v3(ptr %n, ptr %verts, ptr %v_prev.016, i64 %wide.trip.count) {
entry:
br label %for.body
for.body: ; preds = %for.body, %entry
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%v_prev.0163 = phi ptr [ %verts, %entry ], [ %v_curr.017, %for.body ]
%v_curr.017 = getelementptr nusw [3 x float], ptr %verts, i64 %indvars.iv
%0 = load float, ptr %v_curr.017, align 4
store float %0, ptr %n, align 4
%1 = load float, ptr %v_prev.0163, align 4
store float %1, ptr %v_prev.016, align 4
%indvars.iv.next = add nsw i64 %indvars.iv, 1
%exitcond.not = icmp eq i64 %indvars.iv, %wide.trip.count
br i1 %exitcond.not, label %for.end.loopexit, label %for.body
for.end.loopexit: ; preds = %for.body
ret float 0.000000e+00
} |
@lukel97 plz see above issue ? |
Fixed-order recurrence phis cannot be forced to be scalar, they will always be widened at the moment. Make sure we don't add them to ForcedScalars, otherwise the legacy cost model will compute incorrect costs. This fixes an assertion reported with #129645.
Looks like the issue was that the legacy cost model would treat first-order recurrence phis as forced scalars, but they cannot be scalarized, they will always generate a wide recipe. Pushed 41c1a7b to fix the legacy cost model to not force such phis as scalars. |
Fixed-order recurrence phis cannot be forced to be scalar, they will always be widened at the moment. Make sure we don't add them to ForcedScalars, otherwise the legacy cost model will compute incorrect costs. This fixes an assertion reported with llvm/llvm-project#129645.
Fixed-order recurrence phis cannot be forced to be scalar, they will always be widened at the moment. Make sure we don't add them to ForcedScalars, otherwise the legacy cost model will compute incorrect costs. This fixes an assertion reported with llvm#129645.
After #124093 we now support fixed-order recurrences with EVL tail folding by replacing VPInstruction::FirstOrderRecurrenceSplice with a VP splice intrinsic.
However the costing for the splice is currently done in VPFirstOrderRecurrencePHIRecipe, so when we add the VP splice intrinsic we end up costing it twice.
This fixes it by splitting out the cost for the splice into FirstOrderRecurrenceSplice so that it's not duplicated when we replace it.
We still have to keep the VF=1 checks in VPFirstOrderRecurrencePHIRecipe since the splice might end up dead and discarded, e.g. in the test @pr97452_scalable_vf1_for.