29

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.

The backing story:

What I really want is VHADDPS/_mm256_hadd_ps to act like HADDPS/_mm_hadd_ps, only with 256 bit words. Unfortunately, it acts like two calls to HADDPS acting independently on the low and high words.

1

3 Answers 3

31

Using VPERM2F128, one can swap the low 128 and high 128 bits ( as well as other permutations). The instrinsic function usage looks like

x = _mm256_permute2f128_ps( x , x , 1)

The third argument is a control word which gives the user a lot of flexibility. See the Intel Instrinsic Guide for details.

4
  • The Intel reference manual specifies the control word: VPERM2F128 (direct link) - AVX2 also has VPERM2I128 which basically does the same - don't know why Intel felt that they need 2 different instructions since the type shouldn't make a difference, or should it? Commented Mar 7, 2020 at 19:47
  • 2
    This answers my question: Why both? vperm2f128 (avx) vs vperm2i128 (avx2) Commented Mar 7, 2020 at 19:59
  • 1
    The valignq can also be used to do the equivalent of a ROR on 512 bits with a 64 bits increment (use valignd to get 32 bits instead). Commented Nov 17, 2020 at 4:29
  • @AlexisWilke: That requires AVX-512. With just AVX2, you can use an immediate vpermq to swap halves of a single vector. vperm2f128 only requires AVX1 but is slower than vpermq on a few CPUs (e.g. Zen1 and KNL). Commented Nov 17, 2020 at 16:32
4
x = _mm256_permute4x64_epi64(x, 0b01'00'11'10);

Read about it here. And Try it online!

Note: This instruction needs AVX2 (not just AVX1).

As commented by @PeterCordes speed-wise on Zen2 / Zen3 CPUs _mm256_permute2x128_si256(x, x, i) is the best option, even though it has 3 arguments compared to function _mm256_permute4x64_epi64(x, i) suggested by me having 2 arguments.

On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) suggested by me is more efficient. On other CPUs (including mainstream Intel), both choices are equal.

As already said both _mm256_permute2x128_si256(x, y, i) and _mm256_permute4x64_epi64(x, i) need AVX2, while _mm256_permute2f128_si256(x, i) needs just AVX1.

10
  • 3
    This requires AVX2 not just AVX1, but yes it's faster on a few CPUs than VPERM2F128, and the same on others. (Including Zen1 surprisingly uops.info, and Knight's Landing where 2-input shuffles are slower). I don't think it's worse anywhere, except for CPUs with only AVX1 like Sandybridge and Piledriver that couldn't run it at all. Commented May 21, 2021 at 21:52
  • @PeterCordes Thanks for comment! I'll add a note that it needs AVX2. I just thought when OP wrote that he needs AVX instruction he actually could mean that he needs any version of AVX, it is usually the case. Same like when somebody just says I need SSE solution he actually means in most cases SSE2-SSE4.2. But yes it is up to OP to clarify what he actually needs. Still my solution would be useful for some people. At least for me this question popped up in Google when I actually needed avx2 solution.
    – Arty
    Commented May 22, 2021 at 3:18
  • 2
    Yes, exactly, on Zen2 / Zen3 _mm256_permute2x128_si256(x, x, i) is the best option, repeating the same input twice. On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) is more efficient. On other CPUs (including mainstream Intel), both choices are equal. AVX1 CPUs don't have a choice, only vperm2f128 is available. Even vpermpd is AVX2. Commented Jun 2, 2021 at 3:44
  • 2
    vperm2f128 (AVX1) and vperm2i128 (AVX2) run the same on every AVX2 CPU. I don't think there's extra bypass latency on any real CPUs for using the f128 version between AVX2 integer instructions, but it's probably a good idea to use the i128 version - it shouldn't ever be worse than vperm2f128, although it can be worse than vpermq depending on the CPU. Commented Jun 2, 2021 at 3:46
  • 1
    both run at same speed everywhere - that's something I'm not 100% sure about. It's possible some CPUs could have extra latency if you use vperm2f128 between vpaddb ymm, ymm instructions for example. So if you're using other __m256i intrinsics that also require AVX2, use _mm256_permute2x128_si256 or _mm256_permute4x64_epi64. If you're using __m256 or __m256d in a function that only requires AVX1 (and maybe FMA), it's not worth making a separate AVX2 version just for vpermpd, unless you want to tune for Zen1 specifically (taking into account its 128-bit vector hardware). Commented Jun 2, 2021 at 4:09
3

The only way that I know of doing this is with _mm256_extractf128_si256 and _mm256_set_m128i. E.g. to swap the two halves of a 256 bit vector:

__m128i v0h = _mm256_extractf128_si256(v0, 0);
__m128i v0l = _mm256_extractf128_si256(v0, 1);
__m256i v1 = _mm256_set_m128i(v0h, v0l);
6
  • 2
    Do you know the difference between _mm256_extractf128_si256 and _mm256_extracti128_si256? The only thing I can tell is that the first one works with AVX and the second requires AVX2. Why would anyone ever use the second version. I look at Agner Fog's instruction tables and latency, throughput, and ports are identical. Maybe I should ask this as a question.
    – Z boson
    Commented Sep 5, 2014 at 8:40
  • 1
    I thought I'd already seen this asked somewhere on SO but a quick search didn't turn it up - AFAIK they are effectively the same.
    – Paul R
    Commented Sep 5, 2014 at 9:32
  • @Zboson: oops - just found the question I mentioned above - I should have searched for the instructions rather than the intrinsics: stackoverflow.com/questions/18996827/…
    – Paul R
    Commented Sep 5, 2014 at 11:06
  • I believe this way is slower than Mark's answer, since extractf and set each have lat 3, throughput 1.
    – mafu
    Commented Apr 26, 2017 at 3:13
  • 1
    @mafu: yes, true - note also that clang (and perhaps other compilers) is smart enough to convert the above into a single vperm2f128, making it essentially the same as Mark's answer.
    – Paul R
    Commented Apr 26, 2017 at 6:05

Not the answer you're looking for? Browse other questions tagged or ask your own question.