You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
blake3_impl.c and blake3module.c are adapted from the existing BLAKE2
module. This involves a lot of copy-paste, and hopefully someone who
knows this code better can help me clean them up. (In particular, BLAKE2
relies on clinic codegen to share code between BLAKE2b and BLAKE2s, but
BLAKE3 has no need for that.)
blake3_dispatch.c, which is vendored from upstream, includes runtime CPU
feature detection to choose the appropriate SIMD instruction set for the
current platform (x86 only). In this model, the build should include all
instruction sets, and here I unconditionally include the Unix assembly
files (*_unix.S) as `extra_objects` in setup.py. This "works on my box",
but is currently incomplete in several ways:
- It needs some Windows-specific build logic. There are two additional
assembly flavors included for each instruction set, *_windows_gnu.S
and *_windows_msvc.asm. I need to figure out how to include the right
flavor based on the target OS/ABI.
- I need to figure out how to omit these files on non-x86-64 platforms.
x86-32 will require some explicit preprocessor definitions to restrict
blake3_dispatch.c to portable code. (Unless we vendor intrinsics-based
implementations for 32-bit support. More on this below.)
- It's not going to work on compilers that are too old to recognize
these instruction sets, particularly AVX-512. (Question: What's the
oldest GCC version that CPython supports?) Maybe compiler feature
detection could be added to ./configure and somehow plumbed through to
setup.py.
I'm hoping someone more experienced with the build system can help me
narrow down the best solution for each of those.
This also raises the higher level question of whether the CPython
project feels comfortable about including assembly files in general. As
a possible alternative, the upstream BLAKE3 project also provides
intrinsics-based implementations of the same optimizations. The upsides
of these are 1) that they don't require Unix/Windows platform detection,
2) that they support 32-bit x86 targets, and 3) that C is easier to
audit than assembly. However, the downsides of these are 1) that they're
~10% slower than the hand-written assembly, 2) that their performance is
less consistent and worse on older compilers, and 3) that they take
noticeably longer to compile. We recommend the assembly implementations
for these reasons, but intrinsics are a viable option if assembly
violates CPython's requirements.
0 commit comments