From 0f7a2ff4e96137dc8b55f6849574adb88234e024 Mon Sep 17 00:00:00 2001 From: Raymond Hettinger Date: Sun, 6 Sep 2020 12:21:34 -0700 Subject: [PATCH 1/5] Note performance aspects of using multiple accumulators --- Modules/mathmodule.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c index d227a5d15dca2a..b9a3029641fed8 100644 --- a/Modules/mathmodule.c +++ b/Modules/mathmodule.c @@ -2456,7 +2456,10 @@ Given that csum >= 1.0, we have: Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum. To minimize loss of information during the accumulation of fractional -values, each term has a separate accumulator. +values, each term has a separate accumulator. This also breaks up +sequential dependencies in the inner loop so the CPU can maximize +floating point throughput. On a 2.6 GHz Haswell, moving from +hypot(x,y) to hypot(x,y,z) costs only 5ns. The square root differential correction is needed because a correctly rounded square root of a correctly rounded sum of From 7a29756a9d916ee54dc741ff23e00a8d95205c74 Mon Sep 17 00:00:00 2001 From: Raymond Hettinger Date: Sun, 6 Sep 2020 13:03:54 -0700 Subject: [PATCH 2/5] Add accuracy notes --- Modules/mathmodule.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c index b9a3029641fed8..b50a9c7ad987ae 100644 --- a/Modules/mathmodule.c +++ b/Modules/mathmodule.c @@ -2477,12 +2477,22 @@ algorithm, effectively doubling the number of accurate bits. This technique is used in Dekker's SQRT2 algorithm and again in Borges' ALGORITHM 4 and 5. +Without proof for all cases, hypot() cannot claim to be always +correctly rounded. However, the accuracy of "h + x / (2.0 * h)" for +n <= 1000 is at least 100 bits prior to the final rounding. Also, +hypot() was tested against a Decimal implementation with prec=300. +After 100 million trials no incorrectly rounded examples were found. +In addition, perfect commutativity (all permutations are equal) was +verified for 1 billion random inputs with n=5. + References: 1. Veltkamp-Dekker splitting: http://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf 2. Compensated summation: http://www.ti3.tu-harburg.de/paper/rump/Ru08b.pdf 3. Square root differential correction: https://arxiv.org/pdf/1904.09481.pdf 4. https://www.wolframalpha.com/input/?i=Maclaurin+series+sqrt%28h**2+%2B+x%29+at+x%3D0 +5. https://bugs.python.org/file49435/best_frac.py +6. https://bugs.python.org/file49448/test_hypot_commutativity.py */ From 971ae455a77cdfcc0a6143c00355d09fc841141f Mon Sep 17 00:00:00 2001 From: Raymond Hettinger Date: Sun, 6 Sep 2020 13:14:23 -0700 Subject: [PATCH 3/5] Show where each footnote applies --- Modules/mathmodule.c | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c index b50a9c7ad987ae..b89f0d908e24ce 100644 --- a/Modules/mathmodule.c +++ b/Modules/mathmodule.c @@ -2419,9 +2419,9 @@ To avoid overflow/underflow and to achieve high accuracy giving results that are almost always correctly rounded, four techniques are used: * lossless scaling using a power-of-two scaling factor -* accurate squaring using Veltkamp-Dekker splitting -* compensated summation using a variant of the Neumaier algorithm -* differential correction of the square root +* accurate squaring using Veltkamp-Dekker splitting [1] +* compensated summation using a variant of the Neumaier algorithm [2] +* differential correction of the square root [3] The usual presentation of the Neumaier summation algorithm has an expensive branch depending on which operand has the larger @@ -2458,7 +2458,7 @@ Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum. To minimize loss of information during the accumulation of fractional values, each term has a separate accumulator. This also breaks up sequential dependencies in the inner loop so the CPU can maximize -floating point throughput. On a 2.6 GHz Haswell, moving from +floating point throughput. [5] On a 2.6 GHz Haswell, moving from hypot(x,y) to hypot(x,y,z) costs only 5ns. The square root differential correction is needed because a @@ -2469,7 +2469,7 @@ The differential correction starts with a value *x* that is the difference between the square of *h*, the possibly inaccurately rounded square root, and the accurately computed sum of squares. The correction is the first order term of the Maclaurin series -expansion of sqrt(h**2 + x) == h + x/(2*h) + O(x**2). +expansion of sqrt(h**2 + x) == h + x/(2*h) + O(x**2). [4] Essentially, this differential correction is equivalent to one refinement step in Newton's divide-and-average square root @@ -2479,11 +2479,11 @@ Borges' ALGORITHM 4 and 5. Without proof for all cases, hypot() cannot claim to be always correctly rounded. However, the accuracy of "h + x / (2.0 * h)" for -n <= 1000 is at least 100 bits prior to the final rounding. Also, -hypot() was tested against a Decimal implementation with prec=300. +n <= 1000 is at least 100 bits prior to the final rounding step. [6] +Also, hypot() was tested against a Decimal implementation with prec=300. After 100 million trials no incorrectly rounded examples were found. In addition, perfect commutativity (all permutations are equal) was -verified for 1 billion random inputs with n=5. +verified for 1 billion random inputs with n=5. [7] References: @@ -2491,8 +2491,9 @@ verified for 1 billion random inputs with n=5. 2. Compensated summation: http://www.ti3.tu-harburg.de/paper/rump/Ru08b.pdf 3. Square root differential correction: https://arxiv.org/pdf/1904.09481.pdf 4. https://www.wolframalpha.com/input/?i=Maclaurin+series+sqrt%28h**2+%2B+x%29+at+x%3D0 -5. https://bugs.python.org/file49435/best_frac.py -6. https://bugs.python.org/file49448/test_hypot_commutativity.py +5. https://bugs.python.org/file49439/hypot.png +6. https://bugs.python.org/file49435/best_frac.py +7. https://bugs.python.org/file49448/test_hypot_commutativity.py */ From 080c68a3748e701f65244f984a61e6eecbe0b2f7 Mon Sep 17 00:00:00 2001 From: Raymond Hettinger Date: Sun, 6 Sep 2020 13:23:57 -0700 Subject: [PATCH 4/5] Better wording for the incremental loop costs --- Modules/mathmodule.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c index b89f0d908e24ce..4703d465bb44ae 100644 --- a/Modules/mathmodule.c +++ b/Modules/mathmodule.c @@ -2458,8 +2458,9 @@ Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum. To minimize loss of information during the accumulation of fractional values, each term has a separate accumulator. This also breaks up sequential dependencies in the inner loop so the CPU can maximize -floating point throughput. [5] On a 2.6 GHz Haswell, moving from -hypot(x,y) to hypot(x,y,z) costs only 5ns. +floating point throughput. [5] On a 2.6 GHz Haswell, adding one +dimension has an incremental cost of only 5ns -- for example when +moving from hypot(x,y) to hypot(x,y,z). The square root differential correction is needed because a correctly rounded square root of a correctly rounded sum of From b6ecce695e06d536ae333f8013b6bf511f95e9b2 Mon Sep 17 00:00:00 2001 From: Raymond Hettinger Date: Sun, 6 Sep 2020 14:01:07 -0700 Subject: [PATCH 5/5] Improve wording for the accuracy paragraph --- Modules/mathmodule.c | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c index 4703d465bb44ae..29137ae91a22ff 100644 --- a/Modules/mathmodule.c +++ b/Modules/mathmodule.c @@ -2479,12 +2479,13 @@ This technique is used in Dekker's SQRT2 algorithm and again in Borges' ALGORITHM 4 and 5. Without proof for all cases, hypot() cannot claim to be always -correctly rounded. However, the accuracy of "h + x / (2.0 * h)" for -n <= 1000 is at least 100 bits prior to the final rounding step. [6] -Also, hypot() was tested against a Decimal implementation with prec=300. -After 100 million trials no incorrectly rounded examples were found. -In addition, perfect commutativity (all permutations are equal) was -verified for 1 billion random inputs with n=5. [7] +correctly rounded. However for n <= 1000, prior to the final addition +that rounds the overall result, the internal accuracy of "h" together +with its correction of "x / (2.0 * h)" is at least 100 bits. [6] +Also, hypot() was tested against a Decimal implementation with +prec=300. After 100 million trials, no incorrectly rounded examples +were found. In addition, perfect commutativity (all permutations are +exactly equal) was verified for 1 billion random inputs with n=5. [7] References: