Auto-vectorization via `masked.load` blocks constprop #134513

scottmcm · 2025-04-05T23:25:24Z

I was writing some code in Rust and ended up with the following IR, where even though everything's a constant -- it should just be ret i64 165 -- the masked loads from autovectorization on -Ctarget-cpu=x86-64-v3 kept that from happening:

define noundef i64 @test() unnamed_addr #0 {
bb3.preheader:
  %iter = alloca [64 x i8], align 8
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %iter)
  %_3.sroa.5.0.iter.sroa_idx = getelementptr inbounds nuw i8, ptr %iter, i64 16
  store <4 x i64> <i64 23, i64 16, i64 54, i64 3>, ptr %_3.sroa.5.0.iter.sroa_idx, align 8
  %_3.sroa.9.0.iter.sroa_idx = getelementptr inbounds nuw i8, ptr %iter, i64 48
  store i64 60, ptr %_3.sroa.9.0.iter.sroa_idx, align 8
  %_3.sroa.10.0.iter.sroa_idx = getelementptr inbounds nuw i8, ptr %iter, i64 56
  store i64 9, ptr %_3.sroa.10.0.iter.sroa_idx, align 8
  %unmaskedload = load <4 x i64>, ptr %_3.sroa.5.0.iter.sroa_idx, align 8, !alias.scope !2
  %0 = getelementptr inbounds nuw i8, ptr %iter, i64 48
  %wide.masked.load.1 = call <4 x i64> @llvm.masked.load.v4i64.p0(ptr nonnull %0, i32 8, <4 x i1> <i1 true, i1 true, i1 false, i1 false>, <4 x i64> poison), !alias.scope !2
  %1 = add <4 x i64> %wide.masked.load.1, %unmaskedload
  %2 = shufflevector <4 x i64> %1, <4 x i64> %unmaskedload, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
  %3 = tail call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> %2)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %iter)
  ret i64 %3
}

It looks like trunk can't optimize that to a constant either: https://llvm.godbolt.org/z/z6MKz6cz1

(Trunk at least doesn't need the store-load of the vector constant, but it still doesn't const-prop the stores and the masked.load.)

The text was updated successfully, but these errors were encountered:

nikic · 2025-04-06T10:51:26Z

What's the original IR? Ideally this should get constant folded before it gets vectorized.

scottmcm · 2025-04-08T06:04:42Z

Here's the full -C no-prepopulate-passes IR, @nikic : issue-101082.ll

i64 @test_eight() folds fine, probably because 8 elements is a multiple of the vector length, but i64 @test() is a 6-element array and doesn't fold.

nikic · 2025-04-08T20:26:25Z

What happens here is that the load of the loop counter gets load-only promoted by LICM, but at that point we just have the load in the preheader, but not the forwarded value from the initialization to 0. So LoopUnrollFull does not know that this is actually a loop with 6 iterations.

We've recently gained load-only promotion support in SROA, but we currently only use it for readonly calls. I believe we could also use it to SROA the case where we have unknown-offset loads, as long as all the stores are known-offset. That would allow the optimization to occur earlier, including direct forwarding of the initialization value.

nikic · 2025-04-08T20:48:38Z

It looks like that would basically work, but there's an issue with

llvm-project/llvm/lib/Transforms/Scalar/SROA.cpp

Lines 5513 to 5514 in a6853cd

    
           AllSameAndValid &= PartitionEnd->beginOffset() == BeginOffset && 
        
                              PartitionEnd->endOffset() == EndOffset;

blocking the transform if there are lifetime intrinsics. Need to relax that first.

If we do load-only promotion, it is okay if we leave some loads alone. We only need to know all stores that affect a specific location. As such, we can handle loads with unknown offset via the "escaped read-only" code path. This is something we already support in LICM load-only promotion, but doing this in SROA is much better from a phase ordering perspective. Fixes llvm#134513.

…ds (#135609) If we do load-only promotion, it is okay if we leave some loads alone. We only need to know all stores that affect a specific location. As such, we can handle loads with unknown offset via the "escaped read-only" code path. This is something we already support in LICM load-only promotion, but doing this in SROA is much better from a phase ordering perspective. Fixes llvm/llvm-project#134513.

…5609) If we do load-only promotion, it is okay if we leave some loads alone. We only need to know all stores that affect a specific location. As such, we can handle loads with unknown offset via the "escaped read-only" code path. This is something we already support in LICM load-only promotion, but doing this in SROA is much better from a phase ordering perspective. Fixes llvm#134513.

llvmbot added the new issue label Apr 5, 2025

scottmcm mentioned this issue Apr 6, 2025

Polymorphize array::IntoIter's iterator impl rust-lang/rust#139430

Merged

nikic mentioned this issue Apr 7, 2025

Optimization fails to reduce a loop when AVX is enabled #134735

Closed

cuviper mentioned this issue Apr 8, 2025

tests/codegen: ignore x86-64-v3 in issue-101082 for now rust-lang/rust#139503

Closed

nikic self-assigned this Apr 13, 2025

nikic added llvm:optimizations missed-optimization and removed new issue labels Apr 13, 2025

nikic mentioned this issue Apr 14, 2025

[SROA] Support load-only promotion with dynamic offset loads #135609

Merged

nikic closed this as completed in 5c97397 Apr 17, 2025

nikic marked this as a duplicate of #134735 Apr 17, 2025

EugeneZelenko added llvm:transforms and removed llvm:optimizations labels Apr 17, 2025

This was referenced Apr 17, 2025

Test failed in issue-101082.rs: expected string not found in input #136166

Closed

Test failed in issue-101082.rs: expected string not found in input rust-lang/rust#139987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-vectorization via `masked.load` blocks constprop #134513

Auto-vectorization via `masked.load` blocks constprop #134513

scottmcm commented Apr 5, 2025

nikic commented Apr 6, 2025

scottmcm commented Apr 8, 2025 •

edited

Loading

nikic commented Apr 8, 2025

nikic commented Apr 8, 2025

Auto-vectorization via masked.load blocks constprop #134513

Auto-vectorization via masked.load blocks constprop #134513

Comments

scottmcm commented Apr 5, 2025

nikic commented Apr 6, 2025

scottmcm commented Apr 8, 2025 • edited Loading

nikic commented Apr 8, 2025

nikic commented Apr 8, 2025

Auto-vectorization via `masked.load` blocks constprop #134513

Auto-vectorization via `masked.load` blocks constprop #134513

scottmcm commented Apr 8, 2025 •

edited

Loading