[X86] Missed combining of usubo+cmp into sbb #53432

chfast · 2022-01-26T23:13:14Z

This is implementation of less-than of 2x i64 multi-precision integer done by checking the carry flag from subtraction.
https://godbolt.org/z/Kbhnq8c91

define i1 @subcarry_ult_2x64(i64 %x0, i64 %x1, i64 %y0, i64 %y1) nounwind {
  %b0 = icmp ult i64 %x0, %y0
  %d1 = sub i64 %x1, %y1
  %b10 = icmp ult i64 %x1, %y1
  %b0z = zext i1 %b0 to i64
  %b11 = icmp ult i64 %d1, %b0z
  %b1 = or i1 %b10, %b11
  ret i1 %b1
}

Output:

subcarry_ult_2x64:
        sub     rsi, rcx
        setb    cl
        cmp     rdi, rdx
        sbb     rsi, 0
        setb    al
        or      al, cl
        ret

Expected:

subcarry_ult_2x64:
        cmp     rdi, rdx
        sbb     rsi, rcx
        setb    al
        retq

The text was updated successfully, but these errors were encountered:

llvmbot · 2022-02-02T12:30:59Z

@llvm/issue-subscribers-backend-x86

RKSimon · 2022-05-16T07:27:08Z

https://reviews.llvm.org/D125642 improves the codegen to:

subcarry_ult_2x64:
    subq %rcx, %rsi
    setb %cl
    cmpq %rdx, %rdi
    sbbq $0, %rsi
    setb %al
    orb %cl, %al
    retq

RKSimon · 2022-05-17T17:47:31Z

@chfast where did this code snippet originally come from? I'm not sure whether its better trying to recognise that this is a i128 comparison in InstCombine or just continue with backend combines (the remaining pattern is very specific).

@rotateright I'm not sure if we do much to reconstruct wider arithmetic in IR? Although there does appear to be this: https://reviews.llvm.org/D101232

RKSimon · 2022-05-18T08:40:22Z

Alive2: https://alive2.llvm.org/ce/z/7gtizV

The increase in instructions suggests trying to widen the icmp in InstCombine might not be a good idea :(

----------------------------------------
define i1 @src(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
%0:
  %b0 = icmp ult i8 %x0, %y0
  %d1 = sub i8 %x1, %y1
  %b10 = icmp ult i8 %x1, %y1
  %b0z = zext i1 %b0 to i8
  %b11 = icmp ult i8 %d1, %b0z
  %b1 = or i1 %b10, %b11
  ret i1 %b1
}
=>
define i1 @tgt(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
%0:
  %xlo = zext i8 %x0 to i16
  %x1z = zext i8 %x1 to i16
  %ylo = zext i8 %y0 to i16
  %y1z = zext i8 %y1 to i16
  %xhi = shl i16 %x1z, 8
  %yhi = shl i16 %y1z, 8
  %x = or i16 %xhi, %xlo
  %y = or i16 %yhi, %ylo
  %b = icmp ult i16 %x, %y
  ret i1 %b
}
Transformation seems to be correct!

chfast · 2022-05-19T13:59:42Z

@chfast where did this code snippet originally come from? I'm not sure whether its better trying to recognise that this is a i128 comparison in InstCombine or just continue with backend combines (the remaining pattern is very specific).

I have an library implementing uint256 and uint128. In case of uint128 < uint128 compilers are not doing great so I'm falling back to unsigned __int128 if available. I just report some outstanding cases to improve LLVM long term.

I once tried to widen multiplication with mixed results: https://reviews.llvm.org/D56277.

rotateright · 2022-05-19T19:34:31Z

We're actually missing a reduce of the IR in the example with the subtract, so that makes it even less likely that we'd try widening in instcombine:
https://alive2.llvm.org/ce/z/DJW345

define i1 @src(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
  %b0 = icmp ult i8 %x0, %y0
  %b0z = zext i1 %b0 to i8
  %d1 = sub i8 %x1, %y1
  %b11 = icmp ult i8 %d1, %b0z
  ret i1 %b11
}

define i1 @tgt(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
  %b0 = icmp ult i8 %x0, %y0
  %d1 = icmp eq i8 %x1, %y1
  %b11 = and i1 %d1, %b0
  ret i1 %b11
}

rotateright · 2022-05-19T19:35:55Z

I'm not sure if that makes it easier to widen in the backend, but that fold in IR does seem to slightly improve the x86 asm:
https://alive2.llvm.org/ce/z/PiFs3Y

This is the specific pattern seen in #53432, but it can be extended in multiple ways: 1. The 'zext' could be an 'and' 2. The 'sub' could be some other binop with a similar ==0 property (udiv). There might be some way to generalize using knownbits, but that would require checking that the 'bool' value is created with some instruction that can be replaced with new icmp+logic. https://alive2.llvm.org/ce/z/-KCfpa

chfast · 2022-08-04T18:17:38Z

I updated the output which is a bit better now, but still not the ideal.

chfast · 2022-08-04T18:29:04Z

I recompiled C++ code and it now different because of the InstCombine addition.

define i1 @subcarry_ult_2x64_2(i64 %0, i64 %1, i64 %2, i64 %3) nounwind {
  %5 = icmp ult i64 %0, %2
  %6 = icmp ult i64 %1, %3
  %7 = icmp eq i64 %1, %3
  %8 = and i1 %5, %7
  %9 = or i1 %6, %8
  ret i1 %9
}

Output:

subcarry_ult_2x64_2:
        cmp     rdi, rdx
        setb    dl
        cmp     rsi, rcx
        setb    cl
        sete    al
        and     al, dl
        or      al, cl
        ret

https://godbolt.org/z/7GdehTMMz

chfast · 2022-08-12T05:52:07Z

After #56926 and 926e731 the codege result remains the same.

I also found similar example:

src:                                    # @src
        cmp     rdi, rdx
        setb    cl
        test    rsi, rsi
        sete    al
        and     al, cl
        ret
tgt:                                    # @tgt
        cmp     rdi, rdx
        sbb     rsi, 0
        setb    al
        ret

https://godbolt.org/z/9qoM3Tqe1

Do you think I should try to add this optimization in X86 backend only? If yes, any hints where to start?

rotateright · 2022-08-12T12:14:19Z

Looking at the debug output for the target case shows that we transform to target-independent nodes before producing the x86-specific ones:

      t29: i64,i8 = usubo t2, t6
    t30: i8 = setcccarry t4, Constant:i64<0>, t29:1, setult:ch

So there's a possible generic (DAGCombiner) solution (guarded by appropriate legality checks).

We need to pattern match something like this:

        t11: i1 = setcc t4, Constant:i64<0>, seteq:ch
        t8: i1 = setcc t2, t6, setult:ch
      t12: i1 = and t11, t8

...into the above sequence that uses usubo + setcccarry.

RKSimon · 2022-08-12T19:56:08Z

Wasn't something similar proposed in https://reviews.llvm.org/D118037?

chfast · 2022-08-23T11:52:16Z

I added two test cases in https://reviews.llvm.org/D132463. C source: https://godbolt.org/z/W1qqvqGbr.

Wasn't something similar proposed in https://reviews.llvm.org/D118037?

This one seems to be obsoleted by https://reviews.llvm.org/D127115 (topological order) + https://reviews.llvm.org/D57317.

This one is different in a way that we want to combine two setcc and there is no sub nor usubo to start with (they are eliminated by InstCombine).

chfast · 2022-09-30T08:17:19Z

Looking at the debug output for the target case shows that we transform to target-independent nodes before producing the x86-specific ones:
      t29: i64,i8 = usubo t2, t6
    t30: i8 = setcccarry t4, Constant:i64<0>, t29:1, setult:ch
So there's a possible generic (DAGCombiner) solution (guarded by appropriate legality checks).

We need to pattern match something like this:
        t11: i1 = setcc t4, Constant:i64<0>, seteq:ch
        t8: i1 = setcc t2, t6, setult:ch
      t12: i1 = and t11, t8
...into the above sequence that uses usubo + setcccarry.

This looks like very specific pattern to match. Do you think there is a smaller subpattern that would e.g. produce only usubo?

Also, is there a way to dump DAG in this text from without using debugger?

rotateright · 2022-09-30T15:02:43Z

I haven't found a smaller target-independent fold for something like this, but it's possible that we could generalize it. Usually, there's some inverse logic pattern that should also be handled (DeMorgan), so that would end with an 'or' instruction and inverted setcc predicates.

The only ways to dump the DAG that I know of are with the "-debug" option or:
--view-dag-combine1-dags
--view-dag-combine2-dags
(there might be more like that; you can use "llc --help-list-hidden" to show all of the options)

chfast · 2022-09-30T20:22:10Z

There are 2 more patterns already in tests (transcribed by me manually from view-dag)

The one currently produced by InstCombine.
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/subcarry.ll#L681-L699

t4: i1 = setcc t0, t2, setult:ch
t5: i1 = setcc t1, t3, setult:ch
t6: i1 = setcc t1, t3, seteq:ch
t7: i1 = and t4, t6
t8: i1 = or t7, t5

Older one before some changes to InstCombine
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/subcarry.ll#L657-L674

t4: i1 = setcc t0, t2, setult:ch
t5: i64,i1 = usubo t1, t3
t6: i1 = setcc t5:0, zext(t4), setult:ch
t8: i1 = or t5:1, t6

Both should be converted to:

t4: i64,i1 = usubo t0, t2
t5: i1 = setcccarry t1, t3, t4:1, setult:ch

chfast added the llvm:optimizations label Jan 26, 2022

fhahn changed the title ~~Missed combining of usubo+cmp into sbb~~ [X86] Missed combining of usubo+cmp into sbb Feb 2, 2022

fhahn added the backend:X86 label Feb 2, 2022

RKSimon self-assigned this May 15, 2022

fhahn added the missed-optimization label May 19, 2022

RKSimon removed their assignment Jun 23, 2022

chfast mentioned this issue Aug 5, 2022

Missed optimization in InstCombine: replace usub with or #56926

Closed

chfast mentioned this issue Sep 24, 2024

Comparing __uint128_ts generates significantly better code than comparing pairs of uint64_ts #108418

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Missed combining of usubo+cmp into sbb #53432

[X86] Missed combining of usubo+cmp into sbb #53432

chfast commented Jan 26, 2022 •

edited

Loading

llvmbot commented Feb 2, 2022

RKSimon commented May 16, 2022

RKSimon commented May 17, 2022

RKSimon commented May 18, 2022 •

edited

Loading

chfast commented May 19, 2022

rotateright commented May 19, 2022

rotateright commented May 19, 2022

chfast commented Aug 4, 2022

chfast commented Aug 4, 2022

chfast commented Aug 12, 2022

rotateright commented Aug 12, 2022

RKSimon commented Aug 12, 2022

chfast commented Aug 23, 2022

chfast commented Sep 30, 2022 •

edited

Loading

rotateright commented Sep 30, 2022

chfast commented Sep 30, 2022

[X86] Missed combining of usubo+cmp into sbb #53432

[X86] Missed combining of usubo+cmp into sbb #53432

Comments

chfast commented Jan 26, 2022 • edited Loading

llvmbot commented Feb 2, 2022

RKSimon commented May 16, 2022

RKSimon commented May 17, 2022

RKSimon commented May 18, 2022 • edited Loading

chfast commented May 19, 2022

rotateright commented May 19, 2022

rotateright commented May 19, 2022

chfast commented Aug 4, 2022

chfast commented Aug 4, 2022

chfast commented Aug 12, 2022

rotateright commented Aug 12, 2022

RKSimon commented Aug 12, 2022

chfast commented Aug 23, 2022

chfast commented Sep 30, 2022 • edited Loading

rotateright commented Sep 30, 2022

chfast commented Sep 30, 2022

chfast commented Jan 26, 2022 •

edited

Loading

RKSimon commented May 18, 2022 •

edited

Loading

chfast commented Sep 30, 2022 •

edited

Loading