Why is there no floating point intrinsic for `PSHUFD` instruction?

Question

The task I'm facing is to shuffle one _m128 vector and store the result in the other one.

The way I see it, there are two basic ways to shuffle a packed floating point _m128 vector:

_mm_shuffle_ps, which uses SHUFPS instruction that is not necessarily the best option if you want the values from one vector only: it takes two values from the destination operand, which implies an extra move.
_mm_shuffle_epi32, which uses PSHUFD instruction that seems to do exactly what is expected here and can have better latency/throughput than SHUFPS.

The latter intrinsic however works with integer vectors (_m128i) and there seems to be no floating point counterpart, so using it with _m128 would require some ugly explicit casting. Also the fact that there is no such counterpart probably means that there is some proper reason for that, which I am not aware of.

The question is why is there no intrinsic to shuffle one floating point vector and store the result in another?
If _mm_shuffle_ps(x,x, ...) can generate PSHUFPD, can it be guaranteed?
If PSHUFD should not be used for floating point values, what is the reason for that?

Thank you!

There seems to be some mismatch between the title and the rest of the question, btw _mm_shuffle_pd does exist — user555045, Commented Apr 19, 2017 at 12:21
What's wrong with doing __m128 y = _mm_shuffle_ps(x, x, shuf_mask);? Shuffles are very fast; there's no performance gain to be made by them only taking one input. If the look of the code bothers you, then you can write an inline wrapper function or macro. AVX introduced _mm_permute_ps(), which takes one input as you're looking for. — Jason R, Commented Apr 19, 2017 at 12:31
I've never seen a compiler generate the PSHUFD instruction from a _mm_shuffle_ps() call. Can you provide an example? Also, according to Intel's intrinsics guide, the two instructions have the same throughput and latency on all recent architectures (barring any bypass delays from moving between FP and integer domains). — Jason R, Commented Apr 19, 2017 at 12:33
Why do you want PSHUFD? You haven't cited any verifiable reason for why you believe it's better. It's actually likely to be slower due to domain crossing in the SIMD unit. — Jason R, Commented Apr 19, 2017 at 12:36
On affected processors, the "domain crossing" problem is a way bigger deal, performance-wise, than any infinitesimal penalty you'd see from taking two values from the output register. If you want to avoid any sort of penalty, just use registers for all of the operands. It seems you are trying to solve an invented problem. What measurements are you using that are telling you PSHUFD is faster than SHUFPS? As for the reference you asked for, the definitive one is Agner Fog's online resources. See pp. 112 & 129 on agner.org/optimize/microarchitecture.pdf. — Cody Gray - on strike, Commented Apr 19, 2017 at 13:02

icecreamsword · Accepted Answer · 2017-04-19 17:27:12Z

Intrinsics are supposed to map one-to-one with instructions. It would be very undesirable for _mm_shuffle_ps to generate PSHUFD. It should always generate SHUFPS. The documentation does not suggest that there is a case where it would do otherwise.

There is a performance penalty on certain processors when data is cast to single- or double-precision floating-point. This is because the processor augments the SSE registers with internal registers containing the FP classification of the data, e.g. zero or NaN or infinity or normal. When switching types you incur a stall as it performs that step. I don't know if this is still true of modern processors, but you can consult the Intel Architecture Optimization manuals for that information.

SHUFPS is not significantly slower than PSHUFD on modern processors. According to Agner Fog's instruction tables (http://www.agner.org/optimize/instruction_tables.pdf), they have identical latency and throughput on Haswell (4th gen. Core i7). On Nehalem (1st gen. Core i7), they have identical latency, but PSHUFD has a throughput of 2/cycle and SHUFPS has a throughput of 1/cycle. So, you cannot say that one instruction should be preferred over the other across all processors, even if you ignore the performance penalty associated with switching types.

There is also a way to cast between __m128, __m128d, and __m128i: _mm_castXX_YY (https://software.intel.com/en-us/node/695375?language=es) where XX and YY are each one of ps, pd, or si128. For example, _mm_castps_pd(). This is really a bad idea because the processors on which PSHUFD is faster suffer from the performance penalty associated with switching back to FP afterward. In other words, there is no faster way to do a SHUFPS other than doing a SHUFPS.

The benefit to pshufd is that it's a copy-and-shuffle. If the original input is still needed later, pshufd avoids a movaps instruction to copy the register for shufps to modify in-place. The actually reason not to use it isn't a stall of any pipeline, it's bypass latency between SIMD-integer and FP forwarding networks: 2 cycles each way on Nehalem, non-existent for shuffles on Sandybridge-family. (Not FP format info; you're mixing it up with an AMD-specific float-vs-double big penalty, and 1 cycle lower latency when an add/mul/fma instruction consumes the result of the same unit) — Peter Cordes, Commented Jan 12, 2021 at 15:51
See Agner Fog's microarch pdf for details on bypass latency, and on that AMD effect. — Peter Cordes, Commented Jan 12, 2021 at 15:52

Collectives™ on Stack Overflow

Why is there no floating point intrinsic for `PSHUFD` instruction?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
c++
assembly
vectorization
sse
intrinsics
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged c++assemblyvectorizationsseintrinsics or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
assembly
vectorization
sse
intrinsics
or ask your own question.