11

The _mm_shuffle_ps() intrinsic allows one to interleave float inputs into low 2 floats and high 2 floats of the output.

For example:

R = _mm_shuffle_ps(L1, H1, _MM_SHUFFLE(3,2,3,2))

will result in:

R[0] = L1[2];
R[1] = L1[3];
R[2] = H1[2];
R[3] = H1[3]

I wanted to know if there was a similar intrinsic available for the integer data type? Something that took two __m128i variables and a mask for interleaving?

The _mm_shuffle_epi32() intrinsic, takes just one 128-bit vector instead of two.

2
  • 2
    Depends on the size of the elements. If you need 32 bit ints, just use _mm_shuffle_ps, this will work on ints too. Commented Oct 31, 2012 at 8:26
  • so i should just typecast __m128i to __m128? let me see if that works.. Commented Oct 31, 2012 at 15:22

1 Answer 1

16

Nope, there is no integer equivalent to this. So you have to either emulate it, or cheat.

One method is to use _mm_shuffle_epi32() on A and B. Then mask out the desired terms and OR them back together.

That tends to be messy and has around 5 instructions. (Or 3 if you use the SSE4.1 blend instructions.)

Here's the SSE4.1 solution with 3 instructions:

__m128i A = _mm_set_epi32(13,12,11,10);
__m128i B = _mm_set_epi32(23,22,21,20);

A = _mm_shuffle_epi32(A,2*1 + 3*4 + 2*16 + 3*64);
B = _mm_shuffle_epi32(B,2*1 + 3*4 + 2*16 + 3*64);

__m128i C = _mm_blend_epi16(A,B,0xf0);

The method that I prefer is to actually cheat - and floating-point shuffle like this:

__m128i Ai,Bi,Ci;
__m128  Af,Bf,Cf;

Af = _mm_castsi128_ps(Ai);
Bf = _mm_castsi128_ps(Bi);
Cf = _mm_shuffle_ps(Af,Bf,_MM_SHUFFLE(3,2,3,2));
Ci = _mm_castps_si128(Cf);

What this does is to convert the datatype to floating-point so that it can use the float-shuffle. Then convert it back.

Note that these "conversions" are bitwise conversions (aka reinterpretations). No conversion is actually done and they don't map to any instructions. In the assembly, there is no distinction between an integer or a floating-point SSE register. These cast intrinsics are just to get around the type-safety imposed by C/C++.

However, be aware that this approach incurs extra latency for moving data back-and-forth between the integer and floating-point SIMD execution units. So it will be more expensive than just the shuffle instruction.

6
  • That's pretty much what I was about to post, but it took me longer.
    – user555045
    Commented Oct 31, 2012 at 8:26
  • 1
    Try -flax-vector-conversions Commented Oct 31, 2012 at 8:28
  • I wonder how this compares with not switching domains and doing _mm_shuffle_epi32(_mm_unpackhi_epi32(Ai,Bi), 0xd8)?
    – Z boson
    Commented Nov 18, 2014 at 14:42
  • @Zboson I never actually tried that. I can't say I actually need such a shuffle for integers anymore - since I've always been able to find a better data layout that had other benefits.
    – Mysticial
    Commented Nov 18, 2014 at 18:57
  • 3
    @Zboson: There is no extra bypass delay for using FP shuffles on integer data. (On some CPUs, the reverse is not true. On other CPUs, e.g. AMD, even FP shuffles happen in the ivec domain and impose a bypass delay for addps / shufps / addps.) The same shuffle hardware handles FP and int shuffling; it's just a matter of wiring. Apparently it's possible for HW designers to still make the result of FP shuffles available on the integer forwarding network as well as the FP forwarding network. Commented Feb 6, 2016 at 9:58

Not the answer you're looking for? Browse other questions tagged or ask your own question.