add internal _PyLong_FromUnsignedChar() function #82018

sir-sigurd · 2019-08-13T10:29:56Z

BPO	37837
Nosy	@jdemeyer, @gnprice, @sir-sigurd
PRs	bpo-37837: Add internal _PyLong_FromUnsignedChar() function #15251

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-08-13.10:29:55.777>
labels = ['interpreter-core', '3.9', 'performance']
title = 'add internal _PyLong_FromUnsignedChar() function'
updated_at = <Date 2019-08-28.05:55:19.090>
user = 'https://github.com/sir-sigurd'

bugs.python.org fields:

activity = <Date 2019-08-28.05:55:19.090>
actor = 'Greg Price'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Interpreter Core']
creation = <Date 2019-08-13.10:29:55.777>
creator = 'sir-sigurd'
dependencies = []
files = []
hgrepos = []
issue_num = 37837
keywords = ['patch']
message_count = 9.0
messages = ['349540', '349551', '350406', '350407', '350411', '350427', '350435', '350459', '350659']
nosy_count = 3.0
nosy_names = ['jdemeyer', 'Greg Price', 'sir-sigurd']
pr_nums = ['15251']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue37837'
versions = ['Python 3.9']

sir-sigurd · 2019-08-13T10:29:56Z

When compiled with default NSMALLPOSINTS, _PyLong_FromUnsignedChar() is significantly faster than other PyLong_From*():

$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 7.10 ms +- 0.02 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 4.29 ms +- 0.03 ms

Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 7.10 ms +- 0.02 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 4.29 ms +- 0.03 ms: 1.66x faster (-40%)

It's mostly useful for bytes/bytearray, but also can be used in several other places.

jdemeyer · 2019-08-13T12:51:23Z

Maybe an even better idea would be to partially inline PyLong_FromLong(). If the check for small ints in PyLong_FromLong() would be inlined, then the compiler could optimize those checks. This would benefit all users of PyLong_FromLong() without code changes.

gnprice · 2019-08-24T23:03:05Z

Hmm, I'm a bit confused because:

Your patch at new.code and new.function crashes Python iterpretter #59456 replaces a number of calls to PyLong_FromLong with calls to the new _PyLong_FromUnsignedChar.
That function, in turn, just calls PyLong_FromSize_t.
And that function begins:

PyObject *
PyLong_FromSize_t(size_t ival)
{
    PyLongObject *v;
    size_t t;
    int ndigits = 0;

if (ival \< PyLong_BASE)
    return PyLong_FromLong((long)ival);

// ...

So, it seems like after your patch we still end up calling PyLong_FromLong at each of these callsites, just after a couple more indirections than before.

Given the magic of compilers and of hardware branch prediction, it wouldn't at all surprise me for those indirections to not make anything slower... but if the measurements are coming out *faster*, then I feel like something else must be going on. ;-)

Ohhh, I see -- I bet it's that at _PyLong_FromUnsignedChar, the compiler can see that is_small_int(ival) is always true, so the whole function just turns into get_small_int. Whereas when compiling a call to PyLong_FromLong from some other file (other translation unit), it can't see that and can't make the optimization.

Two questions, then:

How do the measurements look under LTO? I wonder if with LTO the linker is able to make the same optimization that this change helps the compiler make.
Is there a particular reason to specifically call PyLong_FromSize_t? Seems like PyLong_FromLong is the natural default (and what we default to in the rest of the code), and it's what this ends up calling anyway.

gnprice · 2019-08-24T23:04:47Z

Oh also:

What compiler, and what compilation flags, are you using in your benchmarking? That seems relevant :)

gnprice · 2019-08-24T23:22:38Z

Is there a particular reason to specifically call PyLong_FromSize_t? Seems like PyLong_FromLong is the natural default (and what we default to in the rest of the code), and it's what this ends up calling anyway.

Ah I see, the patch is meant to go on top of #59397, which makes PyLong_FromSize_t apply the small-int optimization itself.

As I just suggested on #59397, I'd like to see that PR apply the small-int optimization in the more broadly-used PyLong_FromUnsignedLong... and then I think the natural thing for this new function to do is to call that.

Still quite curious how LTO does, and also curious what compiler and flags you're using in benchmarks.

sir-sigurd · 2019-08-25T05:20:44Z

$ gcc -v 2>&1 | grep 'gcc version'
gcc version 8.3.0 (Debian 8.3.0-19)

using ./configure --enable-optimizations --with-lto
$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 6.71 ms +- 0.09 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 6.71 ms +- 0.08 ms
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 6.71 ms +- 0.09 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 6.71 ms +- 0.08 ms: 1.00x slower (+0%)

using ./configure --enable-optimizations
$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 6.73 ms +- 0.17 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 4.28 ms +- 0.01 ms
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 6.73 ms +- 0.17 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 4.28 ms +- 0.01 ms: 1.57x faster (-36%)

gnprice · 2019-08-25T06:32:35Z

Very interesting, thanks!

It looks like with LTO enabled, this optimization has no effect at all.

This change adds significant complexity, and it seems like the hoped-for payoff is entirely in terms of performance on rather narrowly-focused microbenchmarks. In general I think that sets the bar rather high for making sure the performance gains are meaningful enough to justify the increase in complexity in the code.

In particular, I expect most anyone running Python and concerned with performance to be using LTO. (It's standard in distro builds of Python, so that covers probably most users already.) That means if the optimization doesn't do anything in the presence of LTO, it doesn't really count. ;-)

---

Now, I am surprised at the specifics of the result! I expected that LTO would probably pick up the equivalent optimization on its own, so that this change didn't have an effect. Instead, it looks like with LTO, this microbenchmark performs the same as it does without LTO *before* this change. That suggests that LTO may instead be blocking this optimization.

In that case, there may still be an opportunity: if you can work out why the change doesn't help under LTO, maybe you can find a way to make this optimization happen under LTO after all.

sir-sigurd · 2019-08-25T14:01:36Z

These last results are invalid :-)

I thought that I was checking _PyLong_FromUnsignedChar() on top of #59397, but that wasn't true. So the correct results for LTO build are:

$ python -m perf timeit -s "from collections import deque; consume = deque(maxlen=0).extend; b = bytes(2**20)" "consume(b)" --compare-to=../cpython-master/venv/bin/python
/home/sergey/tmp/cpython-master/venv/bin/python: ..................... 6.93 ms +- 0.04 ms
/home/sergey/tmp/cpython-dev/venv/bin/python: ..................... 3.96 ms +- 0.01 ms
Mean +- std dev: [/home/sergey/tmp/cpython-master/venv/bin/python] 6.93 ms +- 0.04 ms -> [/home/sergey/tmp/cpython-dev/venv/bin/python] 3.96 ms +- 0.01 ms: 1.75x faster (-43%)

But the most important thing is that using PyLong_FromUnsignedLong() instead of _PyLong_FromUnsignedChar() on top of #59397 is producing the same results: striter_next() uses small_ints[] directly. However that's not true for bytearrayiter_next(): PyLong_FromUnsignedLong() is called there. I think that's due to how code is profiled so I'm satisfied with these results more or less.

I'm closing existing PR and probably will close this issue soon after #59397 will be merged.

gnprice · 2019-08-28T05:55:19Z

Ah OK, that makes sense of it then :)

But the most important thing is that using PyLong_FromUnsignedLong() instead of _PyLong_FromUnsignedChar() on top of #59397 is producing the same results: striter_next() uses small_ints[] directly. However that's not true for bytearrayiter_next(): PyLong_FromUnsignedLong() is called there. I think that's due to how code is profiled so I'm satisfied with these results more or less.

Very interesting! That's a good comparison to have made, too.

I agree with your conclusions.

iritkatriel · 2022-08-26T15:40:13Z

I'm closing existing PR and probably will close this issue soon after #59397 will be merged.

@sir-sigurd Can this issue be closed?

sir-sigurd · 2022-08-27T07:40:50Z

Yes, since proposed changes were already merged in #32110 and #31867.

sir-sigurd mannequin added 3.9 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage labels Aug 13, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

iritkatriel added the pending The issue will be closed if no feedback is provided label Aug 26, 2022

kumaraditya303 closed this as completed Aug 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add internal _PyLong_FromUnsignedChar() function #82018

add internal _PyLong_FromUnsignedChar() function #82018

sir-sigurd mannequin commented Aug 13, 2019

sir-sigurd mannequin commented Aug 13, 2019

jdemeyer commented Aug 13, 2019

gnprice commented Aug 24, 2019

gnprice commented Aug 24, 2019

gnprice commented Aug 24, 2019

sir-sigurd mannequin commented Aug 25, 2019

gnprice commented Aug 25, 2019

sir-sigurd mannequin commented Aug 25, 2019

gnprice commented Aug 28, 2019

iritkatriel commented Aug 26, 2022

sir-sigurd commented Aug 27, 2022

add internal _PyLong_FromUnsignedChar() function #82018

add internal _PyLong_FromUnsignedChar() function #82018

Comments

sir-sigurd mannequin commented Aug 13, 2019

sir-sigurd mannequin commented Aug 13, 2019

jdemeyer commented Aug 13, 2019

gnprice commented Aug 24, 2019

gnprice commented Aug 24, 2019

gnprice commented Aug 24, 2019

sir-sigurd mannequin commented Aug 25, 2019

gnprice commented Aug 25, 2019

sir-sigurd mannequin commented Aug 25, 2019

gnprice commented Aug 28, 2019

iritkatriel commented Aug 26, 2022

sir-sigurd commented Aug 27, 2022