5

With the constraint that I can use only SSE and SSE2 instructions, I have a need to replace the least significant (0) element of a 4 element __m128i vector with the 0 element from another vector.

For floating point vectors, the task is simple - one can use the _mm_move_ss() intrinsic to cause the element to be replaced with the 0 element from another vector. It generates one movss instruction, so is quite efficient.

Using two casting intrinsics, it's possible to also convince the compiler to use a single SSE movss instruction to move integer data. The source code ends up looking like this:

__m128i NewVector = _mm_castps_si128(_mm_move_ss(_mm_castsi128_ps(Take3FromThisVector),
                                                 _mm_castsi128_ps(Take1FromThisVector)));

It looks a bit messy, but with a suitable amount of commenting it can be acceptable, especially since it generates a minimum of instructions. In its typical use everything's optimized to be in xmm registers.

My question is this:

Since it's a movss instruction, where the "ss" implies single precision floating point, is it okay to have it move integer data that could potentially contain some "special" or "illegal" (for floating point) combo of bits in any of the vector positions?

The obvious alternative - which I also implemented and tested - is to AND the first vector with a mask, then OR in a second vector that contains just one value in the least significant element, with all the others being zero. As you can imagine, this generates more instructions.

I've tested the casting approach I showed above and it doesn't seem to cause any problems, but I note in particular that there's no intrinsic provided that does this same operation for integer data. It seems as though Intel would have provided one if it was just as good for integer data - e.g., _mm_move_epi32 or similar. And so I'm skeptical whether this is a good idea.

I did some searches, e.g., "can a movss instruction cause a floating point exception", but did not find any information that would answer my question.

Thanks in advance for knowledge you're willing to share.

-Noel

4
  • 1
    See also stackoverflow.com/questions/13153584/…. Not quite a duplicate because there's more to say about movss's weird design and difference from movd. Commented May 23, 2016 at 2:58
  • 2
    Intel really seems stuck in the assembly age. If the actual instruction shuffles bits without assigning meaning, the C intrinsics should have float and int versions. There's no reason why two intrinsics with different signatures couldn't map to the same instruction.
    – MSalters
    Commented May 23, 2016 at 7:40
  • I know, right? Kind of makes you want to write some new intrinsic inline functions to fill in the gaps.
    – NoelC
    Commented May 23, 2016 at 15:53
  • @MSalters: Intel finally did this for AVX with __m256 and __m256i intrinsics for vinsertf128. (vinserti128 is only in AVX2). Of course, there's not much you can usefully do with __m256i with only AVX1. But that's a great idea. They should absolutely introduce integer instrinsics for shufps, since there's nothing else like it for combining data from two registers until AVX512's vpermt2d (permute 2 vectors, overwriting the Table). Commented May 23, 2016 at 21:09

2 Answers 2

5

Yes, it's fine to use FP shuffles like movss xmm, xmm on integer data. The insn reference manual tells you that it can't raise FP numeric exceptions; only actual FP math instructions do that. So go ahead and cast.

There isn't even a bypass delay for using FP shuffles on integer data in most uarches (but there is extra latency for using integer shuffles between FP math instructions).

Agner Fog's "optimizing assembly" guide has a great section on what instructions are useful for different kinds of data movement (broadcasts, merging, etc.) See also the tag wiki for more good links.


The reason there's no integer intrinsic is that the SSE2 movd integer instruction zeros the upper bytes of the destination, like movss used as a load, but unlike movss between registers.

Intel's vector instruction set known for its inconsistency and non-orthogonality, esp. the earliest versions (like SSE1). SSE4.1 filled many gaps, but there are still obvious missing pieces.

2

The types __m128 and __m128i are interchangeable. The main reason for the cast is to make your intentions clearer (and keep your compiler happy). The cast itself does not generate any extra assembly.

The _mm_move_ss operation is described directly in terms of which bits end up in your result.

If you end up with an invalid bit combination for single-precision floats, this will only be a problem if you try to use the resulting value in floating-point calculations.

Not the answer you're looking for? Browse other questions tagged or ask your own question.