Skip to content

[X86] Missed combining of usubo+cmp into sbb #53432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chfast opened this issue Jan 26, 2022 · 16 comments
Open

[X86] Missed combining of usubo+cmp into sbb #53432

chfast opened this issue Jan 26, 2022 · 16 comments

Comments

@chfast
Copy link
Member

chfast commented Jan 26, 2022

This is implementation of less-than of 2x i64 multi-precision integer done by checking the carry flag from subtraction.
https://godbolt.org/z/Kbhnq8c91

define i1 @subcarry_ult_2x64(i64 %x0, i64 %x1, i64 %y0, i64 %y1) nounwind {
  %b0 = icmp ult i64 %x0, %y0
  %d1 = sub i64 %x1, %y1
  %b10 = icmp ult i64 %x1, %y1
  %b0z = zext i1 %b0 to i64
  %b11 = icmp ult i64 %d1, %b0z
  %b1 = or i1 %b10, %b11
  ret i1 %b1
}

Output:

subcarry_ult_2x64:
        sub     rsi, rcx
        setb    cl
        cmp     rdi, rdx
        sbb     rsi, 0
        setb    al
        or      al, cl
        ret

Expected:

subcarry_ult_2x64:
        cmp     rdi, rdx
        sbb     rsi, rcx
        setb    al
        retq
@fhahn fhahn changed the title Missed combining of usubo+cmp into sbb [X86] Missed combining of usubo+cmp into sbb Feb 2, 2022
@llvmbot
Copy link
Member

llvmbot commented Feb 2, 2022

@llvm/issue-subscribers-backend-x86

@RKSimon RKSimon self-assigned this May 15, 2022
@RKSimon
Copy link
Collaborator

RKSimon commented May 16, 2022

https://reviews.llvm.org/D125642 improves the codegen to:

subcarry_ult_2x64:
    subq %rcx, %rsi
    setb %cl
    cmpq %rdx, %rdi
    sbbq $0, %rsi
    setb %al
    orb %cl, %al
    retq

@RKSimon
Copy link
Collaborator

RKSimon commented May 17, 2022

@chfast where did this code snippet originally come from? I'm not sure whether its better trying to recognise that this is a i128 comparison in InstCombine or just continue with backend combines (the remaining pattern is very specific).

@rotateright I'm not sure if we do much to reconstruct wider arithmetic in IR? Although there does appear to be this: https://reviews.llvm.org/D101232

@RKSimon
Copy link
Collaborator

RKSimon commented May 18, 2022

Alive2: https://alive2.llvm.org/ce/z/7gtizV

The increase in instructions suggests trying to widen the icmp in InstCombine might not be a good idea :(

----------------------------------------
define i1 @src(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
%0:
  %b0 = icmp ult i8 %x0, %y0
  %d1 = sub i8 %x1, %y1
  %b10 = icmp ult i8 %x1, %y1
  %b0z = zext i1 %b0 to i8
  %b11 = icmp ult i8 %d1, %b0z
  %b1 = or i1 %b10, %b11
  ret i1 %b1
}
=>
define i1 @tgt(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
%0:
  %xlo = zext i8 %x0 to i16
  %x1z = zext i8 %x1 to i16
  %ylo = zext i8 %y0 to i16
  %y1z = zext i8 %y1 to i16
  %xhi = shl i16 %x1z, 8
  %yhi = shl i16 %y1z, 8
  %x = or i16 %xhi, %xlo
  %y = or i16 %yhi, %ylo
  %b = icmp ult i16 %x, %y
  ret i1 %b
}
Transformation seems to be correct!

@chfast
Copy link
Member Author

chfast commented May 19, 2022

@chfast where did this code snippet originally come from? I'm not sure whether its better trying to recognise that this is a i128 comparison in InstCombine or just continue with backend combines (the remaining pattern is very specific).

I have an library implementing uint256 and uint128. In case of uint128 < uint128 compilers are not doing great so I'm falling back to unsigned __int128 if available. I just report some outstanding cases to improve LLVM long term.

I once tried to widen multiplication with mixed results: https://reviews.llvm.org/D56277.

@rotateright
Copy link
Contributor

We're actually missing a reduce of the IR in the example with the subtract, so that makes it even less likely that we'd try widening in instcombine:
https://alive2.llvm.org/ce/z/DJW345

define i1 @src(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
  %b0 = icmp ult i8 %x0, %y0
  %b0z = zext i1 %b0 to i8
  %d1 = sub i8 %x1, %y1
  %b11 = icmp ult i8 %d1, %b0z
  ret i1 %b11
}

define i1 @tgt(i8 %x0, i8 %x1, i8 %y0, i8 %y1) {
  %b0 = icmp ult i8 %x0, %y0
  %d1 = icmp eq i8 %x1, %y1
  %b11 = and i1 %d1, %b0
  ret i1 %b11
}

@rotateright
Copy link
Contributor

I'm not sure if that makes it easier to widen in the backend, but that fold in IR does seem to slightly improve the x86 asm:
https://alive2.llvm.org/ce/z/PiFs3Y

rotateright added a commit that referenced this issue May 22, 2022
This is the specific pattern seen in #53432, but it can be extended
in multiple ways:
1. The 'zext' could be an 'and'
2. The 'sub' could be some other binop with a similar ==0 property (udiv).

There might be some way to generalize using knownbits, but that
would require checking that the 'bool' value is created with
some instruction that can be replaced with new icmp+logic.

https://alive2.llvm.org/ce/z/-KCfpa
@RKSimon RKSimon removed their assignment Jun 23, 2022
@chfast
Copy link
Member Author

chfast commented Aug 4, 2022

I updated the output which is a bit better now, but still not the ideal.

@chfast
Copy link
Member Author

chfast commented Aug 4, 2022

I recompiled C++ code and it now different because of the InstCombine addition.

define i1 @subcarry_ult_2x64_2(i64 %0, i64 %1, i64 %2, i64 %3) nounwind {
  %5 = icmp ult i64 %0, %2
  %6 = icmp ult i64 %1, %3
  %7 = icmp eq i64 %1, %3
  %8 = and i1 %5, %7
  %9 = or i1 %6, %8
  ret i1 %9
}

Output:

subcarry_ult_2x64_2:
        cmp     rdi, rdx
        setb    dl
        cmp     rsi, rcx
        setb    cl
        sete    al
        and     al, dl
        or      al, cl
        ret

https://godbolt.org/z/7GdehTMMz

@chfast
Copy link
Member Author

chfast commented Aug 12, 2022

After #56926 and 926e731 the codege result remains the same.

I also found similar example:

src:                                    # @src
        cmp     rdi, rdx
        setb    cl
        test    rsi, rsi
        sete    al
        and     al, cl
        ret
tgt:                                    # @tgt
        cmp     rdi, rdx
        sbb     rsi, 0
        setb    al
        ret

https://godbolt.org/z/9qoM3Tqe1

Do you think I should try to add this optimization in X86 backend only? If yes, any hints where to start?

@rotateright
Copy link
Contributor

Looking at the debug output for the target case shows that we transform to target-independent nodes before producing the x86-specific ones:

      t29: i64,i8 = usubo t2, t6
    t30: i8 = setcccarry t4, Constant:i64<0>, t29:1, setult:ch

So there's a possible generic (DAGCombiner) solution (guarded by appropriate legality checks).

We need to pattern match something like this:

        t11: i1 = setcc t4, Constant:i64<0>, seteq:ch
        t8: i1 = setcc t2, t6, setult:ch
      t12: i1 = and t11, t8

...into the above sequence that uses usubo + setcccarry.

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 12, 2022

Wasn't something similar proposed in https://reviews.llvm.org/D118037?

@chfast
Copy link
Member Author

chfast commented Aug 23, 2022

I added two test cases in https://reviews.llvm.org/D132463. C source: https://godbolt.org/z/W1qqvqGbr.

Wasn't something similar proposed in https://reviews.llvm.org/D118037?

This one seems to be obsoleted by https://reviews.llvm.org/D127115 (topological order) + https://reviews.llvm.org/D57317.

This one is different in a way that we want to combine two setcc and there is no sub nor usubo to start with (they are eliminated by InstCombine).

@chfast
Copy link
Member Author

chfast commented Sep 30, 2022

Looking at the debug output for the target case shows that we transform to target-independent nodes before producing the x86-specific ones:

      t29: i64,i8 = usubo t2, t6
    t30: i8 = setcccarry t4, Constant:i64<0>, t29:1, setult:ch

So there's a possible generic (DAGCombiner) solution (guarded by appropriate legality checks).

We need to pattern match something like this:

        t11: i1 = setcc t4, Constant:i64<0>, seteq:ch
        t8: i1 = setcc t2, t6, setult:ch
      t12: i1 = and t11, t8

...into the above sequence that uses usubo + setcccarry.

This looks like very specific pattern to match. Do you think there is a smaller subpattern that would e.g. produce only usubo?

Also, is there a way to dump DAG in this text from without using debugger?

@rotateright
Copy link
Contributor

I haven't found a smaller target-independent fold for something like this, but it's possible that we could generalize it. Usually, there's some inverse logic pattern that should also be handled (DeMorgan), so that would end with an 'or' instruction and inverted setcc predicates.

The only ways to dump the DAG that I know of are with the "-debug" option or:
--view-dag-combine1-dags
--view-dag-combine2-dags
(there might be more like that; you can use "llc --help-list-hidden" to show all of the options)

@chfast
Copy link
Member Author

chfast commented Sep 30, 2022

There are 2 more patterns already in tests (transcribed by me manually from view-dag)

The one currently produced by InstCombine.
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/subcarry.ll#L681-L699

t4: i1 = setcc t0, t2, setult:ch
t5: i1 = setcc t1, t3, setult:ch
t6: i1 = setcc t1, t3, seteq:ch
t7: i1 = and t4, t6
t8: i1 = or t7, t5

Older one before some changes to InstCombine
https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/X86/subcarry.ll#L657-L674

t4: i1 = setcc t0, t2, setult:ch
t5: i64,i1 = usubo t1, t3
t6: i1 = setcc t5:0, zext(t4), setult:ch
t8: i1 = or t5:1, t6

Both should be converted to:

t4: i64,i1 = usubo t0, t2
t5: i1 = setcccarry t1, t3, t4:1, setult:ch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants