generating pi using Machin like formula

Question

Back in 2002, using this arctan formula for pi discoverd by Stoemer:

$$ \pi = 176 \arctan(1/57) + 28 \arctan(1/239) - 48 \arctan(1/682) \\ + 96 \arctan(1/12943)$$

and I assume using this series for arctan:

$$ \arctan(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \frac{x^9}{9} - \dotso $$

$\pi$ was generated to 1.2 trillion digits on a super computer, using 64 nodes of a HITACHI SR8000/MPP, in about 157 hours (it took 400 hours for another 4 term formula). Link to web site announcing the result:

https://web.archive.org/web/20150225043658/http://www.super-computing.org/pi_current.html

I assume each initial term was 480GB, which translates to 1,030,792,151,040 hexadecimal digits, corresponding to the articles stated accurate size of 1,030,700,000,000 hexadecimal digits. Since each node had 16GB of ram, this meant the numbers had to be streamed on and off hard drives. The terms decrease in size with each iteration, but only a tiny fraction of the total time will be spent with terms that fit in each node's ram.

Using the $\arctan$ series seems like it should be fairly straightforward, basically $term[i+1] = term[i] \cdot (1/z^2)$, where $z$ is 57, 239, 682, or 12943. The terms are combined (adding and subtracting), then divided by $(2i+1)$.

The terms are kept as large vectors (fixed point array), while the divisors are scalars (relatively small fixed number of bits), so the division is similar to long hand division where a multi-digit dividend is divided by a single digit divisor, with the quotients being generated by multiply + shift to speed up the process, explained next.

Dividing by a constant can be sped up using multiply by "magic" number and right shift some number of bits. For the sequence of divisors, 1, 3, 5, 7, ..., the "magic" numbers are generated dynamically (requires two actual divides), since the same divisor is being used across a large vector term.

A "magic" number $M$ is in the range: $$\frac{2^{N+L}}{divisor} \leq M \leq \frac{2^{N+L} + 2^L}{divisor}$$ where $L$ is $\big \lceil \log_2{divisor} \big \rceil$, and $N$ is set to produce the required precision. If interested, there's a SO Q&A about this:

https://stackoverflow.com/questions/41183935

Using this method, my system (Intel 3770K 3.5 ghz, Win 7 Pro 64 bit, assembly code) is able to generate 1 million digits (in binary) in about 140 seconds. Update - using 6 threads on the 4 core 3770K to overlap operations, this time was reduced to 48 seconds. With more cores and threads, the series could be split up into separate components, in addition to overlapped operations.

However, the team effort to produce 1.2 trillion digits back in 2002 was much faster than the straight forward method I describe, and involved much more complex code, taking over 75,000 lines of code. I've tried searching to get an idea of what was involved, but not having any luck. One issue is that much faster (and simpler to implement using a big int library) Chudnovsky based formulas have been used since then.

http://en.wikipedia.org/wiki/Chudnovsky_algorithm

in which case optimized implementations can generate 1 million digits of $\pi$ in about 8 seconds on my system (apparently without using parallel calculations or multi-theading)

I can find articles about optimized implementations of Chudnovsky like formulas, but I haven't found anything like this for the Machin like formulas based on arctan. I'm wondering if anyone here would know of a web site that might explain how arctan based formulas could be optimized.

Update, there were 4 terms and 64 nodes, each with a lot of memory, to compute the terms in parallel. Consider the $(1/57)$ term being calculated using just 4 of the nodes. The initial phase: node 0: $(1/57^2)$, node 1: $(1/57^4)$, node 2: $(1/57^6)$, node 3: $(1/57^8)$, after this initial phase, each node multiplies it's term by $(1/57^8)$, picking up $\approx 46.7$ bits per iteration.

If all 64 nodes computed the $(1/57)$ term, each node multiplies it's term by $(1/57^{128})$, picking up $\approx 746.6$ bits per iteration, which translates into $\approx 4.45$ billion iterations for 1 trillion digits. $57^{128}$ would fit in a 12 x 64 bit word (96 byte) divisor. Karatsuba could be used for extended precision multiply / shift ("magic" numbers) to produce quotients from such divisors.

The team probably analyzed the computing load for each of the 4 terms and distributed the generation accordingly. Still this doesn't sound like 75,000+ lines of code to implement so there was more to it.

Maybe they don't use that series for computing $\arctan$, since according to Wikipedia it converges extremely slowly -- just 5 digits of $\pi$ after the first 500,000 terms! en.wikipedia.org/wiki/Pi#Infinite_series — j_random_hacker, Commented Jan 16, 2017 at 11:10
@j_random_hacker - Note that the slowest converging term is odd powers of (1/57): (1/57), (1/57^3), (1/57^5), ..., about 11.6658 bits or 3.51175 decimal digits per iteration, about 1 million digits per 284,759 iterations. In my updated answer, I noted that with n nodes computing the same (1/57) term in parallel, they generate n x 3.51175 decimal digits per iteration, or 1 million digits in 284,759 / n iterations. — rcgldr, Commented Jan 18, 2017 at 14:01

Pseudonym · Accepted Answer · 2017-01-30 22:26:33Z

I cannot remember where I learned the trick for doing this. It doesn't seem very well publicised. It may have been from a Richard Brent paper (seems likely). Nonetheless, here goes...

There are a few principles which are helpful to remember if you're working with extremely large integers:

Avoid division of a large integer by a large integer.
Try to multiply integers of similar magnitudes.

Large integer division is typically done using an iterative algorithm (e.g. Newton-Raphson), so you want to avoid that where possible. What this means in practice is that it is probably cheaper to calculate a numerator and denominator separately, and then conclude with a final division. This appears to be standard practice for evaluating Machin-like formulae.

Large integer multiplication is typically done using a FFT-based algorithm (e.g. Schönhage–Strassen, Fürer) for large enough integers, which has better complexity if the number of bits in each of the multiplicands is approximately the same. This suggests a divide-and-conquer algorithm for evaluating numbers which are essentially large products.

The basic idea is this. Suppose you want to calculate the series:

$$\sum_{i=0}^{n} \frac{P(i)}{Q(i)}$$

You can calculate the two half-series:

$$\frac{P_1}{Q_1} = \sum_{i=0}^{\left\lfloor n/2\right\rfloor} \frac{P(i)}{Q(i)}$$ $$\frac{P_2}{Q_2} = \sum_{i=\left\lfloor n/2\right\rfloor +1}^{n} \frac{P(i)}{Q(i)}$$

Then:

$$\frac{P}{Q} = \frac{P_1 Q_2 + P_2 Q_1}{Q_1 Q_2}$$

It should be obvious how to turn this into a recursive algorithm.

That's the general approach, but once you're thinking along these lines, there are some obvious optimisations that you can do with the arctan series in particular:

$$\arctan\left(\frac{1}{x}\right) = \frac{1}{x} - \frac{1}{3x^3} + \frac{1}{5x^5} - \frac{1}{7x^7} \cdots$$

The numerators are all one, and the denominators have a structure that you can exploit if you're careful. I'll leave this for you to discover.

EDIT Small test program in Haskell, didn't try to optimise it.

module MachinLike where

import Data.Ratio

arctanSeries :: Int -> Int -> Int -> (Integer,Integer)
arctanSeries x i j
    | i == j = let c = 2*i+1
               in (if even i then 1 else -1,
                   fromIntegral c * (fromIntegral x)^c)
    | otherwise = let m = (i+j) `div` 2
                      (p1,q1) = arctanSeries x i m
                      (p2,q2) = arctanSeries x (m+1) j
                  in (p1*q2 + p2*q1, q1*q2)

-- I'm just guessing on the appropriate number of terms in
-- each series. Proper numeric analysis is needed.
--
-- Also, in case it's not obvious, this isn't the final step
-- that you'd do if you were serious about high accuracy.
approxPi :: Int -> Double
approxPi n
    = let (p1,q1) = arctanSeries 57 0 n
          (p2,q2) = arctanSeries 239 0 (n `div` 4)
          (p3,q3) = arctanSeries 682 0 (n `div` 11)
          (p4,q4) = arctanSeries 12943 0 (n `div` 227)
      in fromRational ((176 * p1)%q1)
       + fromRational ((28 * p2)%q2)
       - fromRational ((48 * p3)%q3)
       + fromRational ((96 * p4)%q4)

And running it:

*MachinLike> :set +s
*MachinLike> approxPi 500
3.141592653589793
(0.28 secs, 799,354,048 bytes)

Do remember that the 75,000 lines of code mentioned probably includes large slabs of what you'd typically find in a multiprecision integer library. In a real implementation, the final division steps would involve an iterative algorithm with increasing precision on each iteration.

By the way, another little hint as to how to speed this up: Dynamic programming.

Manipulate numerators and denominators separately, and save the division until the very end. — Pseudonym, Commented Jan 30, 2017 at 4:02
The binary splitting (divide-and-conquer) method lets you use $O(n \log n)$ multiplication algorithms. That's really what helps this situation. Just for comparison, I implemented factorial in Haskell (which uses libgmp for big integer) using two different methods: sequential multiplication and binary splitting. From 100,000! to 200,000!, sequential time increased by a factor of 4 (1.96 sec to 8.24 sec). From 1,000,000! to 2,000,000!, binary splitting time increased roughly $O(n \log n)$ (0.65 sec to 1.57 sec). Exploiting the the faster integer multiply algorithms makes the difference. — Pseudonym, Commented Jan 30, 2017 at 23:36
If you look at my test code, there is no division except at the very end. — Pseudonym, Commented Jan 31, 2017 at 7:09
Binary splitting is not quite O(n log(n)), more like O(n log(n) log(log(n))) for large numbers or O(n^(log2(3)) for medium numbers. — rcgldr, Commented Jan 31, 2017 at 8:37
Avoid division of a large integer by a large integer. - That's not being used in this case. It's division of a large integer by a small (a single word in my code) integer, with time complexity O(n). As mentioned in my question, it's similar to dividing a large decimal number by a single digit. — rcgldr, Commented Feb 1, 2017 at 20:28

Tord · Accepted Answer · 2020-04-16 16:48:47Z

1

Perhaps they used

$\text{arctan}\frac{1}{x}=i\sum\limits_{m=1}^{\infty}\frac{1}{2m-1}\left(\frac{1}{\left(1+2ix\right)^{2m-1}}-\frac{1}{\left(1-2ix\right)^{2m-1}}\right)$

or

$\text{arctan}\frac{1}{m}=m\sum\limits_{k=0}^{\infty}\frac{\left(4k+3\right)m^2-\left(4k+1\right)}{\left(16k^2+16k+3\right)m^{4\left(k+1\right)}}$

edited Apr 16, 2020 at 16:48

answered Apr 16, 2020 at 12:36

Tord

112 bronze badges

$\begingroup$ I'm thinking that the 64 nodes that were used allowed the series to be split up into separate components and calculated in parallel, with a single final addition of the separate components. $\endgroup$
– rcgldr
Commented Apr 16, 2020 at 18:27

Add a comment |

Stack Exchange Network

generating pi using Machin like formula

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
algorithms
numerical-algorithms
program-optimization
or ask your own question.

Hot Network Questions

generating pi using Machin like formula

2 Answers 2

Not the answer you're looking for? Browse other questions tagged algorithmsnumerical-algorithmsprogram-optimization or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
algorithms
numerical-algorithms
program-optimization
or ask your own question.