3

Why are there separate floating point registers xmm0-xmm15 in intel x64 CPUs?

I know xmm's are also used for vector operations where some instruction(SSE*) is executed on several numbers in one register. Other than that, why should one use xmm0-xmm15 registers instead of general-purpose ones rax, rbx, rcx, rdx, rbp, rsp, rdi, rsi r8-r15?

1 Answer 1

3

Some generic pokes at an answer:

  1. You put your finger right on it - the XMM registers avail you of vector operations, e.g., SSE in various versions, AVX, etc. It's possible to use vector operations to do some VERY sophisticated processing. Pixels in an image, for example, generally contain several related color channels (red, green, blue, and alpha, for example). Vector operations on pixels can net you huge increases in performance. Considering that processors are becoming more parallel today rather than being given faster clock speeds (22 core Xeon processor, anyone?), doing more CPU-intensive operations on more data in parallel is a Good Thing. It facilitates getting more cores running on a job at the same time given limited memory bandwidth.

  2. If you don't have only tiny, modular functions (and we all have to face big, complex logic flows sometimes), more registers can be better for efficiency. Ideally the most heavily executed stretches of code should be done with minimal RAM accesses, so being able to put all important variables in registers is a good thing.

Note that vector operations aren't only for floating point; there are many integer operations where SSE instructions can be beneficial. It's possible, for example, to code highly efficient memory moves by fetching cache lines full of data via 128 bit instructions with processors built in the last 15 years (i.e., supporting only SSE2).

Last but not least, as a negative the process of getting data into and out of the XMM registers can be a bit of a challenge. Unless you've planned the system design carefully, switching back and forth between using registers "the old way" without vectors, and using vectors, can be inefficient.

However, once you start to think about what vector operations can do for you, some pretty cool new horizons open up. Imagine a loop object, for example, that facilitates multiply nested for loops with, for example, X and Y coordinates kept in a single vector. Just one PADDD (_mm_add_epi32) instruction can increment an outer loop variable AND reset an inner loop variable to prepare for the next set of iterations.

A handy reference:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

1
  • Thanks for detailed answer, accepted, though not enough reputation to vote. Link also very useful.
    – anon
    Commented Sep 6, 2016 at 10:54

You must log in to answer this question.