Skip to content

Auto-vectorization via masked.load blocks constprop #134513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
scottmcm opened this issue Apr 5, 2025 · 4 comments · Fixed by #135609
Closed

Auto-vectorization via masked.load blocks constprop #134513

scottmcm opened this issue Apr 5, 2025 · 4 comments · Fixed by #135609

Comments

@scottmcm
Copy link

scottmcm commented Apr 5, 2025

I was writing some code in Rust and ended up with the following IR, where even though everything's a constant -- it should just be ret i64 165 -- the masked loads from autovectorization on -Ctarget-cpu=x86-64-v3 kept that from happening:

define noundef i64 @test() unnamed_addr #0 {
bb3.preheader:
  %iter = alloca [64 x i8], align 8
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %iter)
  %_3.sroa.5.0.iter.sroa_idx = getelementptr inbounds nuw i8, ptr %iter, i64 16
  store <4 x i64> <i64 23, i64 16, i64 54, i64 3>, ptr %_3.sroa.5.0.iter.sroa_idx, align 8
  %_3.sroa.9.0.iter.sroa_idx = getelementptr inbounds nuw i8, ptr %iter, i64 48
  store i64 60, ptr %_3.sroa.9.0.iter.sroa_idx, align 8
  %_3.sroa.10.0.iter.sroa_idx = getelementptr inbounds nuw i8, ptr %iter, i64 56
  store i64 9, ptr %_3.sroa.10.0.iter.sroa_idx, align 8
  %unmaskedload = load <4 x i64>, ptr %_3.sroa.5.0.iter.sroa_idx, align 8, !alias.scope !2
  %0 = getelementptr inbounds nuw i8, ptr %iter, i64 48
  %wide.masked.load.1 = call <4 x i64> @llvm.masked.load.v4i64.p0(ptr nonnull %0, i32 8, <4 x i1> <i1 true, i1 true, i1 false, i1 false>, <4 x i64> poison), !alias.scope !2
  %1 = add <4 x i64> %wide.masked.load.1, %unmaskedload
  %2 = shufflevector <4 x i64> %1, <4 x i64> %unmaskedload, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %3 = tail call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %2)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %iter)
  ret i64 %3
}

It looks like trunk can't optimize that to a constant either: https://llvm.godbolt.org/z/z6MKz6cz1

(Trunk at least doesn't need the store-load of the vector constant, but it still doesn't const-prop the stores and the masked.load.)

@nikic
Copy link
Contributor

nikic commented Apr 6, 2025

What's the original IR? Ideally this should get constant folded before it gets vectorized.

@scottmcm
Copy link
Author

scottmcm commented Apr 8, 2025

Here's the full -C no-prepopulate-passes IR, @nikic : issue-101082.ll

i64 @test_eight() folds fine, probably because 8 elements is a multiple of the vector length, but i64 @test() is a 6-element array and doesn't fold.

@nikic
Copy link
Contributor

nikic commented Apr 8, 2025

What happens here is that the load of the loop counter gets load-only promoted by LICM, but at that point we just have the load in the preheader, but not the forwarded value from the initialization to 0. So LoopUnrollFull does not know that this is actually a loop with 6 iterations.

We've recently gained load-only promotion support in SROA, but we currently only use it for readonly calls. I believe we could also use it to SROA the case where we have unknown-offset loads, as long as all the stores are known-offset. That would allow the optimization to occur earlier, including direct forwarding of the initialization value.

@nikic
Copy link
Contributor

nikic commented Apr 8, 2025

It looks like that would basically work, but there's an issue with

AllSameAndValid &= PartitionEnd->beginOffset() == BeginOffset &&
PartitionEnd->endOffset() == EndOffset;
blocking the transform if there are lifetime intrinsics. Need to relax that first.

@nikic nikic self-assigned this Apr 13, 2025
nikic added a commit to nikic/llvm-project that referenced this issue Apr 14, 2025
If we do load-only promotion, it is okay if we leave some loads
alone. We only need to know all stores that affect a specific
location.

As such, we can handle loads with unknown offset via the "escaped
read-only" code path.

This is something we already support in LICM load-only promotion,
but doing this in SROA is much better from a phase ordering
perspective.

Fixes llvm#134513.
@nikic nikic closed this as completed in 5c97397 Apr 17, 2025
@nikic nikic marked this as a duplicate of #134735 Apr 17, 2025
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this issue Apr 17, 2025
…ds (#135609)

If we do load-only promotion, it is okay if we leave some loads alone.
We only need to know all stores that affect a specific location.

As such, we can handle loads with unknown offset via the "escaped
read-only" code path.

This is something we already support in LICM load-only promotion, but
doing this in SROA is much better from a phase ordering perspective.

Fixes llvm/llvm-project#134513.
var-const pushed a commit to ldionne/llvm-project that referenced this issue Apr 17, 2025
…5609)

If we do load-only promotion, it is okay if we leave some loads alone.
We only need to know all stores that affect a specific location.

As such, we can handle loads with unknown offset via the "escaped
read-only" code path.

This is something we already support in LICM load-only promotion, but
doing this in SROA is much better from a phase ordering perspective.

Fixes llvm#134513.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants