-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[AArch64] Suboptimal code for multiplication by certain constants #89430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-aarch64 Author: Karl Meakin (Kmeakin)
For some constants, GCC is able to generate sequences of `add` where LLVM generates `mul`. I have checked all constants between 1 and 100 (https://godbolt.org/z/rxej44fGj):
For all of the examples below(11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85), LLVM generates mulK:
mov w8, K
mul w0, w0, w8
ret mul11:
add w1, w0, w0, lsl 2
add w0, w0, w1, lsl 1
ret mul13:
add w1, w0, w0, lsl 1
add w0, w0, w1, lsl 2
ret mul19:
add w1, w0, w0, lsl 3
add w0, w0, w1, lsl 1
ret mul21:
add w1, w0, w0, lsl 2
add w0, w0, w1, lsl 2
ret mul25:
add w0, w0, w0, lsl 2
add w0, w0, w0, lsl 2
ret mul27:
add w0, w0, w0, lsl 1
add w0, w0, w0, lsl 3
ret mul35:
add w1, w0, w0, lsl 4
add w0, w0, w1, lsl 1
ret mul37:
add w1, w0, w0, lsl 3
add w0, w0, w1, lsl 2
ret mul41:
add w1, w0, w0, lsl 2
add w0, w0, w1, lsl 3
ret
mul49:
add w1, w0, w0, lsl 1
add w0, w0, w1, lsl 4
ret mul51:
add w0, w0, w0, lsl 1
add w0, w0, w0, lsl 4
ret mul69:
add w1, w0, w0, lsl 4
add w0, w0, w1, lsl 2
ret mul73:
add w1, w0, w0, lsl 3
add w0, w0, w1, lsl 3
ret mul81:
add w0, w0, w0, lsl 3
add w0, w0, w0, lsl 3
ret mul85:
add w0, w0, w0, lsl 2
add w0, w0, w0, lsl 4
ret |
I believe #88791 and its follow-up patches will fix this :) |
We have to be a bit careful weighing these optimizations; for certain combinations of target CPU/shift amount/register width, two add-with-shift instructions are actually more expensive than a multiply. |
Also, gcc misses some combinations, for example:
|
…+shl+add Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to add w8, w0, w0, lsl #m add w0, w0, w8, lsl #n Note: The latency can vary depending on the shirt amount Fix part of llvm#89430
Do we have any evidence that these are better as add+shift? As far as I understand GCC optimized it this way because older cores had slower mul and faster add+lsl, but that has changed in more recent cores and mul is now usually relatively quick. |
…+shl+add Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to add w8, w0, w0, lsl #m add w0, w0, w8, lsl #n Note: The latency of add can vary depending on the shirt amount They are cheap as a move when the shift amounts is 4 or less. Fix part of llvm#89430
…+shl+add Change the costmodel to lower a = b * C where C = (1 + 2^m) * 2^n + 1 to add w8, w0, w0, lsl #m add w0, w0, w8, lsl #n Note: The latency of add can vary depending on the shirt amount They are cheap as a move when the shift amounts is 4 or less. Fix part of llvm#89430
the all above numbers listed
|
This case is also not supported by llvm now |
…+shl+sub Change the costmodel to lower a = b * C where C = 1 - (1 - 2^m) * 2^n to sub w8, w0, w0, lsl #m sub w0, w0, w8, lsl #n Fix llvm#89430
For some constants, GCC is able to generate sequences of
add
where LLVM generatesmul
. I have checked all constants between 1 and 100 (https://godbolt.org/z/rxej44fGj):For all of the examples below(11, 13, 19, 21, 25, 27, 35, 37, 41, 49, 51, 69, 73, 81, 85), LLVM generates
The text was updated successfully, but these errors were encountered: