You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[LoopUnroll] Clamp PartialThreshold for large LoopMicroOpBufferSize
The znver3/znver4 scheduler modules are outliers, specifying very
large LoopMicroOpBufferSizes at 512, while typical values for
other subtargets are on the order of ~50. Even if this information
is micro-architecturally correct (*), this does not mean that we
want to runtime unroll all loops to a size that completely fills
the loop buffer. Unless this is the single hot loop in the entire
application, the massive code size increase will bust the micro-op
and instruction caches.
Protect against this by clamping to the default PartialThreshold
of 150, which is the same as the default full-unroll threshold
and half the aggressive full-unroll threshold. Allowing more
partial unrolling than full unrolling is certainly non-sensical.
(*) I strongly doubt that this is actually correct -- I believe
this may derive from an incorrect reading of Agner Fog's
micro-architecture guide. The number 4096 that was originally
used here is the size of the general micro-op cache, not that of
a loop buffer. A separate loop buffer is not listed for the Zen
microarchitecture. Comparing this to the listing for Skylake, it
has a 1536 micro-op buffer, but only a 64 micro-op loopback buffer,
with a note that it's rarely fully utilized. Our scheduling model
specifies LoopMicroOpBufferSize of 50 in that case.
0 commit comments