40

Has any processor ever implemented an integer square root instruction? Obviously, floating-point square root instructions are quite common, but I've never seen one specifically for integers.

One close candidate I know of is the Nintendo DS, which contained a memory-mapped integer divider/square rooter. This was helpful for 3D calculations, given that its ARM processor had no FPU or hardware divider. However, it's not a native processor instruction.

9
  • 2
    tough one; square rooting is something that you don't want to do combinatorically in a single clock cycle, or else your maximum clock rate will go dooooown, and it uses the same algorithmic components that an integer multiply and division need. This makes it a prime candidate for implementation in iterative microcode. Now, what's the difference between a CPU that implements an isqrt instruction in microcode vs software that implements a isqrt function on a CPU that has no isqrt instruction, precisely? Commented Apr 5 at 14:34
  • 3
    Non-iterative implementation for large input sizes get unhandy very quickly, and for small sizes, simple lookup tables might be quite performant. What is the minimum input size you would consider an implementation? Commented Apr 5 at 14:46
  • 1
    By "processor" do you mean microprocessor? Or do you intend to include mainframes, DSPs, etc?
    – DrSheldon
    Commented Apr 5 at 15:38
  • 1
    I can't imagine a huge demand for intsqrt. Normally, if your task needs sqrt you probably want float arithmetic. But I guess intsqrt is useful if you're doing fixed-point arithmetic, or implementing a bare-bones floating-point package.
    – PM 2Ring
    Commented Apr 6 at 4:31
  • 2
    I add the isqrt instruction only because I noticed that the bignum library had it as a primitive. It was obvious to me that it's not trivial to implement, particularly if someone were to need that for large integers. I went through the code, in that bignum library and noticed it was poor. I put fixes in to make it faster, and then due to the IKEA effect of having made it mine by working on it, I had to expose it in the language.
    – Kaz
    Commented Apr 7 at 17:56

4 Answers 4

51

Yes. The Harris RTX 2000 Forth CPU offered a multi-step square root instruction, as did its military-grade sibling, the RTX 2010. There may have been others, but that is the one I know of.

See: https://users.ece.cmu.edu/~koopman/stack_computers/sec4_5.html

3
  • 1
    Nice, do you know any more detailed description?
    – Raffzahn
    Commented Apr 5 at 12:41
  • 3
    @Raffzahn Not off the top of my head. But I do see when searching, you should specify Harris as part of the search. Otherwise the results are mostly pages relating to a series of NVidia GPU boards. As far as I can tell, there is no commonality besides the general model number.
    – RichF
    Commented Apr 5 at 14:20
  • 7
    Interesting; according to the reference manual (users.ece.cmu.edu/~koopman/forth/…), it's an iterated square root where you have to run 1 setup instruction and then 15 step instructions to get the final answer. I've seen that done for division on other processors. And of course it would be a Forth CPU that would have such an esoteric instruction, heh.
    – v-rob
    Commented Apr 5 at 18:52
42

Is 1946 retro enough?

From Wikipedia:

ENIAC used four of the accumulators (controlled by a special multiplier unit) to perform up to 385 multiplication operations per second; five of the accumulators were controlled by a special divider/square-rooter unit to perform up to 40 division operations per second or three square root operations per second.

ENIAC accumulators operated on decimal integers.

17

It's not that easy.

The most efficient method to calculate square root is to calculate inverse/reciprocal of the square root using Newton-Raphson iterations, and then multiply it with the original.

This is best known as the "Quake method" (see also Where did Fast InvSqrt() come from?). The more modern version used by contemporary CPU and GPUs are generalized into two instructions, one for estimating the initial guess (e.g. frsqrte of ARMv8), another to run the following iterations (e.g. frsqrts of ARMv8). Single-instruction version of sqrt is a micro-coded or pseudo-instruction version of these two instructions.

The prerequisite for all of this is a multiplier.

If you want to calculate FP (i)sqrt, then you need a (fast) FP multiplier, which all FPUs have.

If you want to calculate integer (i)sqrt, then you need a (fast) integer multiplier, which most CPUs don't have (historically). Otherwise it would be called a DSP.

To make it better, you need a (fast) multiplier that is twice the width of your input to have sufficient precision, which most CPUs definitely don't have until "relatively" recently (relative to RetroComputing).

And precision matters, or not?

If you look at the "Quake method" closely, you notice that one of the iterations was commented out.

There are a lot of use cases where the extreme precision isn't necessary and it'll be better to leave the choice of precision/speed trade off to programmers. isqrt was intentionally separated into fsqrte and fsqrts on ARMv8 exactly for this reason: so that the programmer can adjust the number of fsqrts for the desired speed and accuracy tradeoff.

So I don't quite agree to the statement that single instruction sqrt is very common. It's there because the IEEEE754 and the C stand math library requires it (for the flag bits and exceptions), but that doesn't mean it's frequently used.

Further reading

  • SPRC542 TI's math library for C64x DSP (8-issue VLIW CPU with two 32x32=64 multipliers). In this library _iq _IQisqrt is implemented using Newton-Raphson iterations and _iq _IQsqrt is calculated by multiplying the original with the isqrt. The source code is available on request.
  • SPRUGG9 TMS320C64x+ IQmath Library User's Guide.The user guide for SPRC542.
  • My implementation of square root using binary search, that doesn't depend on a multiplier. Only basic ALU instructions are used. It is vigorously undocumented. I have no idea what I wrote but it seems to work.

.

unsigned int usqrt(unsigned int x){
    unsigned int a=0;
    unsigned int masksq=0,mask=0;
    unsigned int mask_shift=15;
    for(masksq=1u<<(mask_shift<<1),mask=1u<<(mask_shift);
        mask!=0;
        masksq=masksq>>2,mask=mask>>1,mask_shift--){
        if(x>=masksq){
            a=mask;
            break;
        }
    }
    x-=masksq;//masksq==a*a;
    mask=mask>>1;
    mask_shift--;
    while(mask>0){
        unsigned int step=(mask<<mask_shift)+(a<<(mask_shift+1));
        if(x>=step){
            a|=mask;
            x-=step;
        }
        mask=mask>>1;
        mask_shift--;
    }
    return a;
}
13
  • 3
    What is recent for RetroComputing? The 8086 (from 1978) definitely did have an integer muliplier, even one that could multiply two 16 bit values into a 32 bit value. These instructions weren't really fast, though, and thus assembly books back then suggested to use the shift-register approach, as this was often faster than using the native imul/idiv instructions.
    – PMF
    Commented Apr 5 at 12:57
  • 4
    @PMF An integer multiply instruction doesn't say anything about a integer multiplier. Commented Apr 5 at 13:27
  • 5
    In what sense do IEEE-754 and the C standard math library require a single instruction square root? As far as I know, all IEEE-754 cares about is things like memory layout and precision, and all C cares about is semantics. Unless there's something I'm missing, both will allow for the square root to be done in several instructions, a function call, or some memory-mapped funk like the one the Nintendo DS has. Commented Apr 5 at 14:49
  • 6
    @OmarandLorraine you're right that nothing in IEEE754 requires an instruction. You're wrong about IEEE754 not defining semantics: the standard very much defines operations like multiplications, conversion to integer and square roots, along with their correct result. Commented Apr 5 at 15:37
  • 2
    Aside from being rebuted in Chris Lomont’s post, this hardly even attempts to answer the question posed by the asker, whether there are any implementations of integer square root in hardware. Commented Apr 8 at 15:47
8

The claim in another answer of the Quake trick being the most efficient has not been true for a long time, and was only true regarding low-quality results for floats on specific hardware. On pretty much every modern chip the native instructions are much, much faster, often a few clock cycles. (I'm the Chris Lomont that wrote an early, widely cited paper on the Quake trick, providing generalizaitions and an improvement that seems to have been copied everywhere, despite it being a terrible idea now).

A much quicker method, one used in hardware (with many more tricks), is to store a (non-equal spaced) table of values, use a quick method to pull two values, linear (or better) interpolate, shift base 2 exponent with any odd excess becoming a multiple by constant sqrt2, then, if needed, one iteration of methods better than Newton–Raphson.

Things like Halley's (and many others) converge quicker than Netwon–Raphson, and are often much faster depending on what time various operations take.

Approximations for square root on a fixed interval (since all the methods for computers are bounded) are often also faster, polynomial, Cheby stuff, Pade and higher versions, all can be done in software or hardware, depending on what tradeoffs you want.

If you only want integer, say 2^32, the same trick applies, do it in fixed point, and some not too hard analysis lets you bound tables very quickly. Another simple method used in hardware for integers is divide and conquer: each say 8 bits maps to a table of 256 fixed point values, instantly looked up in parallel, then 3 multiples (2 in parallel) give the 32 bit value (after a free truncate).

There's still plenty of research being done on speeding these up (e.g., https://inria.hal.science/hal-03424131), so any technique over 10 years old is most surely obsolete for any metric: speed, power consumption, die size, etc.

5
  • The quake trick is perfectly valid as a first approximation for newton method of calculating x^(-1/2). It also relies on the fact that both input and output values are float, and particularly that mantissa is left-aligned. It can't directly be applied to a pure integer value.
    – lvd
    Commented Apr 9 at 6:11
  • 4
    This does not answer the asked question. It looks like a comment to another answer.
    – Justme
    Commented Apr 9 at 10:59
  • This does look like missing the question asked, which is clearly about (and marked as) historic CPU, not modern transistor heaps - all of that formulated as a reply.
    – Raffzahn
    Commented Apr 9 at 11:49
  • @lvd Even there there's little use. Float sqrt on any hardware with IEEE 754 floats that I've worked on or heard of in a decade is now so fast that any quake like trick is vastly slower. For example, Pentium FSQRT was latency 70 cycles for FSQRT, and Quake. Now it's around 12. This is the same on pretty much all devices with IEEE 754 hardware now - the quake trick died well over a decade ago as any type of perf gain. It's simply much faster and vastly more accurate to simply use hardware sqrt now (especially since on many chips you can do many in parallel via NEON or SSE or ...) Commented Apr 9 at 20:47
  • There's still a possibility that internally fsqrt is still calculated via x^(-1/2) successive approximations and thus there's still a need to have 1st approximation. Not trying to say that approximation would be unconditionally 'quake trick'
    – lvd
    Commented Apr 10 at 10:19

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .