From 0f7a2ff4e96137dc8b55f6849574adb88234e024 Mon Sep 17 00:00:00 2001
From: Raymond Hettinger <raymond.hettinger@gmail.com>
Date: Sun, 6 Sep 2020 12:21:34 -0700
Subject: [PATCH 1/5] Note performance aspects of using multiple accumulators

---
 Modules/mathmodule.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index d227a5d15dca2a..b9a3029641fed8 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -2456,7 +2456,10 @@ Given that csum >= 1.0, we have:
 Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum.
 
 To minimize loss of information during the accumulation of fractional
-values, each term has a separate accumulator.
+values, each term has a separate accumulator.  This also breaks up
+sequential dependencies in the inner loop so the CPU can maximize
+floating point throughput.  On a 2.6 GHz Haswell, moving from
+hypot(x,y) to hypot(x,y,z) costs only 5ns.
 
 The square root differential correction is needed because a
 correctly rounded square root of a correctly rounded sum of

From 7a29756a9d916ee54dc741ff23e00a8d95205c74 Mon Sep 17 00:00:00 2001
From: Raymond Hettinger <raymond.hettinger@gmail.com>
Date: Sun, 6 Sep 2020 13:03:54 -0700
Subject: [PATCH 2/5] Add accuracy notes

---
 Modules/mathmodule.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index b9a3029641fed8..b50a9c7ad987ae 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -2477,12 +2477,22 @@ algorithm, effectively doubling the number of accurate bits.
 This technique is used in Dekker's SQRT2 algorithm and again in
 Borges' ALGORITHM 4 and 5.
 
+Without proof for all cases, hypot() cannot claim to be always
+correctly rounded.  However, the accuracy of "h + x / (2.0 * h)" for
+n <= 1000 is at least 100 bits prior to the final rounding.  Also,
+hypot() was tested against a Decimal implementation with prec=300.
+After 100 million trials no incorrectly rounded examples were found.
+In addition, perfect commutativity (all permutations are equal) was
+verified for 1 billion random inputs with n=5.
+
 References:
 
 1. Veltkamp-Dekker splitting: http://csclub.uwaterloo.ca/~pbarfuss/dekker1971.pdf
 2. Compensated summation:  http://www.ti3.tu-harburg.de/paper/rump/Ru08b.pdf
 3. Square root differential correction:  https://arxiv.org/pdf/1904.09481.pdf
 4. https://www.wolframalpha.com/input/?i=Maclaurin+series+sqrt%28h**2+%2B+x%29+at+x%3D0
+5. https://bugs.python.org/file49435/best_frac.py
+6. https://bugs.python.org/file49448/test_hypot_commutativity.py
 
 */
 

From 971ae455a77cdfcc0a6143c00355d09fc841141f Mon Sep 17 00:00:00 2001
From: Raymond Hettinger <raymond.hettinger@gmail.com>
Date: Sun, 6 Sep 2020 13:14:23 -0700
Subject: [PATCH 3/5] Show where each footnote applies

---
 Modules/mathmodule.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index b50a9c7ad987ae..b89f0d908e24ce 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -2419,9 +2419,9 @@ To avoid overflow/underflow and to achieve high accuracy giving results
 that are almost always correctly rounded, four techniques are used:
 
 * lossless scaling using a power-of-two scaling factor
-* accurate squaring using Veltkamp-Dekker splitting
-* compensated summation using a variant of the Neumaier algorithm
-* differential correction of the square root
+* accurate squaring using Veltkamp-Dekker splitting [1]
+* compensated summation using a variant of the Neumaier algorithm [2]
+* differential correction of the square root [3]
 
 The usual presentation of the Neumaier summation algorithm has an
 expensive branch depending on which operand has the larger
@@ -2458,7 +2458,7 @@ Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum.
 To minimize loss of information during the accumulation of fractional
 values, each term has a separate accumulator.  This also breaks up
 sequential dependencies in the inner loop so the CPU can maximize
-floating point throughput.  On a 2.6 GHz Haswell, moving from
+floating point throughput. [5]  On a 2.6 GHz Haswell, moving from
 hypot(x,y) to hypot(x,y,z) costs only 5ns.
 
 The square root differential correction is needed because a
@@ -2469,7 +2469,7 @@ The differential correction starts with a value *x* that is
 the difference between the square of *h*, the possibly inaccurately
 rounded square root, and the accurately computed sum of squares.
 The correction is the first order term of the Maclaurin series
-expansion of sqrt(h**2 + x) == h + x/(2*h) + O(x**2).
+expansion of sqrt(h**2 + x) == h + x/(2*h) + O(x**2). [4]
 
 Essentially, this differential correction is equivalent to one
 refinement step in Newton's divide-and-average square root
@@ -2479,11 +2479,11 @@ Borges' ALGORITHM 4 and 5.
 
 Without proof for all cases, hypot() cannot claim to be always
 correctly rounded.  However, the accuracy of "h + x / (2.0 * h)" for
-n <= 1000 is at least 100 bits prior to the final rounding.  Also,
-hypot() was tested against a Decimal implementation with prec=300.
+n <= 1000 is at least 100 bits prior to the final rounding step.  [6]
+Also, hypot() was tested against a Decimal implementation with prec=300.
 After 100 million trials no incorrectly rounded examples were found.
 In addition, perfect commutativity (all permutations are equal) was
-verified for 1 billion random inputs with n=5.
+verified for 1 billion random inputs with n=5. [7]
 
 References:
 
@@ -2491,8 +2491,9 @@ verified for 1 billion random inputs with n=5.
 2. Compensated summation:  http://www.ti3.tu-harburg.de/paper/rump/Ru08b.pdf
 3. Square root differential correction:  https://arxiv.org/pdf/1904.09481.pdf
 4. https://www.wolframalpha.com/input/?i=Maclaurin+series+sqrt%28h**2+%2B+x%29+at+x%3D0
-5. https://bugs.python.org/file49435/best_frac.py
-6. https://bugs.python.org/file49448/test_hypot_commutativity.py
+5. https://bugs.python.org/file49439/hypot.png
+6. https://bugs.python.org/file49435/best_frac.py
+7. https://bugs.python.org/file49448/test_hypot_commutativity.py
 
 */
 

From 080c68a3748e701f65244f984a61e6eecbe0b2f7 Mon Sep 17 00:00:00 2001
From: Raymond Hettinger <raymond.hettinger@gmail.com>
Date: Sun, 6 Sep 2020 13:23:57 -0700
Subject: [PATCH 4/5] Better wording for the incremental loop costs

---
 Modules/mathmodule.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index b89f0d908e24ce..4703d465bb44ae 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -2458,8 +2458,9 @@ Since lo**2 is less than 1/2 ulp(csum), we have csum+lo*lo == csum.
 To minimize loss of information during the accumulation of fractional
 values, each term has a separate accumulator.  This also breaks up
 sequential dependencies in the inner loop so the CPU can maximize
-floating point throughput. [5]  On a 2.6 GHz Haswell, moving from
-hypot(x,y) to hypot(x,y,z) costs only 5ns.
+floating point throughput. [5] On a 2.6 GHz Haswell, adding one
+dimension has an incremental cost of only 5ns -- for example when
+moving from hypot(x,y) to hypot(x,y,z).
 
 The square root differential correction is needed because a
 correctly rounded square root of a correctly rounded sum of

From b6ecce695e06d536ae333f8013b6bf511f95e9b2 Mon Sep 17 00:00:00 2001
From: Raymond Hettinger <raymond.hettinger@gmail.com>
Date: Sun, 6 Sep 2020 14:01:07 -0700
Subject: [PATCH 5/5] Improve wording for the accuracy paragraph

---
 Modules/mathmodule.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/Modules/mathmodule.c b/Modules/mathmodule.c
index 4703d465bb44ae..29137ae91a22ff 100644
--- a/Modules/mathmodule.c
+++ b/Modules/mathmodule.c
@@ -2479,12 +2479,13 @@ This technique is used in Dekker's SQRT2 algorithm and again in
 Borges' ALGORITHM 4 and 5.
 
 Without proof for all cases, hypot() cannot claim to be always
-correctly rounded.  However, the accuracy of "h + x / (2.0 * h)" for
-n <= 1000 is at least 100 bits prior to the final rounding step.  [6]
-Also, hypot() was tested against a Decimal implementation with prec=300.
-After 100 million trials no incorrectly rounded examples were found.
-In addition, perfect commutativity (all permutations are equal) was
-verified for 1 billion random inputs with n=5. [7]
+correctly rounded.  However for n <= 1000, prior to the final addition
+that rounds the overall result, the internal accuracy of "h" together
+with its correction of "x / (2.0 * h)" is at least 100 bits. [6]
+Also, hypot() was tested against a Decimal implementation with
+prec=300.  After 100 million trials, no incorrectly rounded examples
+were found.  In addition, perfect commutativity (all permutations are
+exactly equal) was verified for 1 billion random inputs with n=5. [7]
 
 References: