You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
Replace `BPFMIPeepholeTruncElim` by adding an overload for
`TargetLowering::isZExtFree()` aware that zero extension is
free for `ISD::LOAD`.
Short description
=================
The `BPFMIPeepholeTruncElim` handles two patterns:
Pattern #1:
%1 = LDB %0, ... %1 = LDB %0, ...
%2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!)
Pattern #2:
bb.1: bb.1:
%a = LDB %0, ... %a = LDB %0, ...
br %bb3 br %bb3
bb.2: bb.2:
%b = LDB %0, ... -> %b = LDB %0, ...
br %bb3 br %bb3
bb.3: bb.3:
%1 = PHI %a, %b %1 = PHI %a, %b
%2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!)
Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32
Both patterns could be handled by built-in transformations at
instruction selection phase if suitable `isZExtFree()` implementation
is provided. The idea is borrowed from `ARMTargetLowering::isZExtFree`.
When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM
test cases this revisions performs slightly better than
BPFMIPeepholeTruncElim, see "Impact" section below for details.
Commit also adds a few test cases to make sure that patterns in
question are handled.
Long description
================
Why this works: Pattern #1
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command:
...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
t19: i64 = and t16, Constant:i64<255>
t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Optimized type-legalized selection DAG:
- `t19 = and t16, 255` had been replaced by `t16` (load).
- Patterns like `(and (load ... i8), 255)` are replaced by `load`
in `DAGCombiner::BackwardsPropagateMask` called from
`DAGCombiner::visitAND`.
- Similarly patterns like `(shl (srl ..., 56), 56)` are replaced by
`(and ..., 255)` in `DAGCombiner::visitSRL` (this function is huge,
look for `TLI.shouldFoldConstantShiftPairToMask()` call).
Why this works: Pattern #2
--------------------------
Consider the following example:
define i1 @foo(ptr %p) {
entry:
%a = load i8, ptr %p, align 1
br label %next
next:
%cond = icmp eq i8 %a, 0
ret i1 %cond
}
Consider log for `llc -mcpu=v2 -mtriple=bpfel -debug-only=isel` command.
Log for first basic block:
Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
t0: ch,glue = EntryToken
t3: i64 = Constant<0>
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5
t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %1
t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
t8: ch = CopyToReg t0, Register:i64 %0, t9
Note:
- Initial selection DAG:
- `%a = load ...` is lowered as `t6 = (zero_extend (load ...))`
w/o special `isZExtFree()` overload added by this commit
it is instead lowered as `t6 = (any_extend (load ...))`.
- The decision to generate `zero_extend` or `any_extend` is
done in `RegsForValue::getCopyToRegs` called from
`SelectionDAGBuilder::CopyValueToVirtualRegister`:
- if `isZExtFree()` for load returns true `zero_extend` is used;
- `any_extend` is used otherwise.
- Optimized lowered selection DAG:
- `t6 = (any_extend (load ...))` is replaced by
`t9 = load ..., zext from i8`
This is done by `DagCombiner.cpp:tryToFoldExtOfLoad()` called from
`DAGCombiner::visitZERO_EXTEND`.
Log for second basic block:
Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t5: i8 = truncate t4
t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
t9: i64 = any_extend t8
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t18: i64 = and t4, Constant:i64<255>
t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
t0: ch,glue = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %0
t4: i64 = AssertZext t2, ValueType:ch:i8
t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Note:
- Initial selection DAG:
- `t0` is an input value for this basic block, it corresponds load
instruction (`t9`) from the first basic block.
- It is accessed within basic block via
`t4` (AssertZext (CopyFromReg t0, ...)).
- The `AssertZext` is generated by RegsForValue::getCopyFromRegs
called from SelectionDAGBuilder::getCopyFromRegs, it is generated
only when `LiveOutInfo` with known number of leading zeros is
present for `t0`.
- Known register bits in `LiveOutInfo` are computed by
`SelectionDAG::computeKnownBits` called from
`SelectionDAGISel::ComputeLiveOutVRegInfo`.
- `computeKnownBits()` generates leading zeros information for
`(load ..., zext from ...)` but *does not* generate leading zeros
information for `(load ..., anyext from ...)`.
This is why `isZExtFree()` added in this commit is important.
- Type-legalized selection DAG:
- `t5 = truncate t4` is replaced by `t18 = and t4, 255`
- Optimized type-legalized selection DAG:
- `t18 = and t4, 255` is replaced by `t4`, this is done by
`DAGCombiner::SimplifyDemandedBits` called from
`DAGCombiner::visitAND`, which simplifies patterns like
`(and (assertzext ...))`
Impact
------
This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and
remove_truncate_8 improved slightly, for other tests it is
unchanged.
For remove_truncate_7:
Before this revision After this revision
-------------------- -------------------
r1 <<= 0x20 r1 <<= 0x20
r1 >>= 0x20 r1 >>= 0x20
if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2>
r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0)
goto +0x1 <LBB0_3> goto +0x1 <LBB0_3>
<LBB0_2>: <LBB0_2>:
r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>: <LBB0_3>:
r0 = r1 exit
exit
For remove_truncate_8:
Before this revision After this revision
-------------------- -------------------
r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0)
r3 = r2 r3 = r2
r3 <<= 0x20 r3 <<= 0x20
r4 = r3 r3 s>>= 0x20
r4 s>>= 0x20
if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3>
r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4)
r3 >>= 0x20
if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3>
r2 += 0x2 r2 += 0x2
*(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2
<LBB0_3>: <LBB0_3>:
r0 = 0x3 r0 = 0x3
exit exit
For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences,
in all cases total number of instructions marginally decreased
(-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased
(-129 instruction in total): some redundant `rX &= 0xffff`
and register to register assignments removed;
- For 2 object files number of instructions increased +2
instructions in each file.
Both -mcpu=v2 instruction increases could be reduced to the same
example:
define void @foo(ptr %p) {
entry:
%a = load i32, ptr %p, align 4
%b = sext i32 %a to i64
%c = icmp ult i64 1, %b
br i1 %c, label %next, label %end
next:
call void inttoptr (i64 62 to ptr)(i32 %a)
br label %end
end:
ret void
}
Note that this example uses value loaded to `%a` both as a sign
extended (`%b`) and as zero extended (`%a` passed as parameter).
Here is the difference in final assembly code:
Before this revision After this revision
-------------------- -------------------
r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0)
r1 <<= 32 r1 <<= 32
r1 s>>= 32 r1 s>>= 32
if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2>
r1 <<= 32
r1 >>= 32
call 62 call 62
<LBB0_2>: <LBB0_2>:
exit exit
Before this commit `%a` is passed to call as a sign extended value,
after this commit `%a` is passed to call as a zero extended value,
both are correct as 32-bit sub-register is the same.
The difference comes from `DAGCombiner` operation on the initial DAG:
Initial selection DAG before this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = any_extend t5 <--------------------- (1)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
Initial selection DAG after this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
t6: i64 = zero_extend t5 <--------------------- (2)
t8: ch = CopyToReg t0, Register:i64 %0, t6
t9: i64 = sign_extend t5
t12: i1 = setcc Constant:i64<1>, t9, setult:ch
The node `t9` is processed before node `t6` and `load` instruction is
combined to load with sign extension:
Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
and 1 other values
This is done by `DAGCombiner.cpp:tryToFoldExtOfLoad` called from
`DAGCombiner::visitSIGN_EXTEND`. Note that `t5` is used by `t6` which
is `any_extend` in (1) and `zero_extend` in (2).
`tryToFoldExtOfLoad()` rewrites such uses of `t5` differently:
- `any_extend` is simply removed
- `zero_extend` is replaced by `and t30, 0xffffffff`, which is later
converted to a pair of shifts. This pair of shifts survives till the
end of translation.
Differential Revision: https://reviews.llvm.org/D157870
0 commit comments