Back in 2002, using this arctan formula for pi discoverd by Stoemer:
$$ \pi = 176 \arctan(1/57) + 28 \arctan(1/239) - 48 \arctan(1/682) \\ + 96 \arctan(1/12943)$$
and I assume using this series for arctan:
$$ \arctan(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \frac{x^9}{9} - \dotso $$
$\pi$ was generated to 1.2 trillion digits on a super computer, using 64 nodes of a HITACHI SR8000/MPP, in about 157 hours (it took 400 hours for another 4 term formula). Link to web site announcing the result:
https://web.archive.org/web/20150225043658/http://www.super-computing.org/pi_current.html
I assume each initial term was 480GB, which translates to 1,030,792,151,040 hexadecimal digits, corresponding to the articles stated accurate size of 1,030,700,000,000 hexadecimal digits. Since each node had 16GB of ram, this meant the numbers had to be streamed on and off hard drives. The terms decrease in size with each iteration, but only a tiny fraction of the total time will be spent with terms that fit in each node's ram.
Using the $\arctan$ series seems like it should be fairly straightforward, basically $term[i+1] = term[i] \cdot (1/z^2)$, where $z$ is 57, 239, 682, or 12943. The terms are combined (adding and subtracting), then divided by $(2i+1)$.
The terms are kept as large vectors (fixed point array), while the divisors are scalars (relatively small fixed number of bits), so the division is similar to long hand division where a multi-digit dividend is divided by a single digit divisor, with the quotients being generated by multiply + shift to speed up the process, explained next.
Dividing by a constant can be sped up using multiply by "magic" number and right shift some number of bits. For the sequence of divisors, 1, 3, 5, 7, ..., the "magic" numbers are generated dynamically (requires two actual divides), since the same divisor is being used across a large vector term.
A "magic" number $M$ is in the range: $$\frac{2^{N+L}}{divisor} \leq M \leq \frac{2^{N+L} + 2^L}{divisor}$$ where $L$ is $\big \lceil \log_2{divisor} \big \rceil$, and $N$ is set to produce the required precision. If interested, there's a SO Q&A about this:
https://stackoverflow.com/questions/41183935
Using this method, my system (Intel 3770K 3.5 ghz, Win 7 Pro 64 bit, assembly code) is able to generate 1 million digits (in binary) in about 140 seconds. Update - using 6 threads on the 4 core 3770K to overlap operations, this time was reduced to 48 seconds. With more cores and threads, the series could be split up into separate components, in addition to overlapped operations.
However, the team effort to produce 1.2 trillion digits back in 2002 was much faster than the straight forward method I describe, and involved much more complex code, taking over 75,000 lines of code. I've tried searching to get an idea of what was involved, but not having any luck. One issue is that much faster (and simpler to implement using a big int library) Chudnovsky based formulas have been used since then.
http://en.wikipedia.org/wiki/Chudnovsky_algorithm
in which case optimized implementations can generate 1 million digits of $\pi$ in about 8 seconds on my system (apparently without using parallel calculations or multi-theading)
I can find articles about optimized implementations of Chudnovsky like formulas, but I haven't found anything like this for the Machin like formulas based on arctan. I'm wondering if anyone here would know of a web site that might explain how arctan based formulas could be optimized.
Update, there were 4 terms and 64 nodes, each with a lot of memory, to compute the terms in parallel. Consider the $(1/57)$ term being calculated using just 4 of the nodes. The initial phase: node 0: $(1/57^2)$, node 1: $(1/57^4)$, node 2: $(1/57^6)$, node 3: $(1/57^8)$, after this initial phase, each node multiplies it's term by $(1/57^8)$, picking up $\approx 46.7$ bits per iteration.
If all 64 nodes computed the $(1/57)$ term, each node multiplies it's term by $(1/57^{128})$, picking up $\approx 746.6$ bits per iteration, which translates into $\approx 4.45$ billion iterations for 1 trillion digits. $57^{128}$ would fit in a 12 x 64 bit word (96 byte) divisor. Karatsuba could be used for extended precision multiply / shift ("magic" numbers) to produce quotients from such divisors.
The team probably analyzed the computing load for each of the 4 terms and distributed the generation accordingly. Still this doesn't sound like 75,000+ lines of code to implement so there was more to it.