How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

Question

I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.

The backing story:

What I really want is VHADDPS/_mm256_hadd_ps to act like HADDPS/_mm_hadd_ps, only with 256 bit words. Unfortunately, it acts like two calls to HADDPS acting independently on the low and high words.

If you just want to horizontal sum, usually you want vextractf128 which is fast everywhere (especially Zen1), narrowing to 128-bit vectors. How to sum __m256 horizontally?. But you wouldn't want haddps as part of an efficient horizontal sum in the first place, so hopefully that wasn't what you were doing... Unless you had multiple hsums to do, then yes, vhaddps can be useful like in Intel AVX: 256-bits version of dot product for double precision floating point variables. And maybe 2x vperm2f128 + vaddps — Peter Cordes, Commented Nov 17, 2020 at 16:37

Mark Borgerding · Accepted Answer · 2020-11-17 16:28:39Z

31

Using VPERM2F128, one can swap the low 128 and high 128 bits ( as well as other permutations). The instrinsic function usage looks like

x = _mm256_permute2f128_ps( x , x , 1)

The third argument is a control word which gives the user a lot of flexibility. See the Intel Instrinsic Guide for details.

edited Nov 17, 2020 at 16:28

answered Aug 28, 2011 at 3:06

Mark Borgerding

8,3384 gold badges32 silver badges52 bronze badges

The Intel reference manual specifies the control word: VPERM2F128 (direct link) - AVX2 also has VPERM2I128 which basically does the same - don't know why Intel felt that they need 2 different instructions since the type shouldn't make a difference, or should it?
– maxschlepzig
Commented Mar 7, 2020 at 19:47
2

This answers my question: Why both? vperm2f128 (avx) vs vperm2i128 (avx2)
– maxschlepzig
Commented Mar 7, 2020 at 19:59
1

The valignq can also be used to do the equivalent of a ROR on 512 bits with a 64 bits increment (use valignd to get 32 bits instead).
– Alexis Wilke
Commented Nov 17, 2020 at 4:29
@AlexisWilke: That requires AVX-512. With just AVX2, you can use an immediate vpermq to swap halves of a single vector. vperm2f128 only requires AVX1 but is slower than vpermq on a few CPUs (e.g. Zen1 and KNL).
– Peter Cordes
Commented Nov 17, 2020 at 16:32

Add a comment |

Arty · Accepted Answer · 2023-03-15 22:35:06Z

4

x = _mm256_permute4x64_epi64(x, 0b01'00'11'10);

Read about it here. And Try it online!

Note: This instruction needs AVX2 (not just AVX1).

As commented by @PeterCordes speed-wise on Zen2 / Zen3 CPUs _mm256_permute2x128_si256(x, x, i) is the best option, even though it has 3 arguments compared to function _mm256_permute4x64_epi64(x, i) suggested by me having 2 arguments.

On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) suggested by me is more efficient. On other CPUs (including mainstream Intel), both choices are equal.

As already said both _mm256_permute2x128_si256(x, y, i) and _mm256_permute4x64_epi64(x, i) need AVX2, while _mm256_permute2f128_si256(x, i) needs just AVX1.

edited Mar 15, 2023 at 22:35

answered May 21, 2021 at 20:48

Arty

16.3k6 gold badges50 silver badges83 bronze badges

3

This requires AVX2 not just AVX1, but yes it's faster on a few CPUs than VPERM2F128, and the same on others. (Including Zen1 surprisingly uops.info, and Knight's Landing where 2-input shuffles are slower). I don't think it's worse anywhere, except for CPUs with only AVX1 like Sandybridge and Piledriver that couldn't run it at all.
– Peter Cordes
Commented May 21, 2021 at 21:52
@PeterCordes Thanks for comment! I'll add a note that it needs AVX2. I just thought when OP wrote that he needs AVX instruction he actually could mean that he needs any version of AVX, it is usually the case. Same like when somebody just says I need SSE solution he actually means in most cases SSE2-SSE4.2. But yes it is up to OP to clarify what he actually needs. Still my solution would be useful for some people. At least for me this question popped up in Google when I actually needed avx2 solution.
– Arty
Commented May 22, 2021 at 3:18
2

Yes, exactly, on Zen2 / Zen3 _mm256_permute2x128_si256(x, x, i) is the best option, repeating the same input twice. On Zen1 and KNL/KNM (and Bulldozer-family Excavator), _mm256_permute4x64_epi64(x, i) is more efficient. On other CPUs (including mainstream Intel), both choices are equal. AVX1 CPUs don't have a choice, only vperm2f128 is available. Even vpermpd is AVX2.
– Peter Cordes
Commented Jun 2, 2021 at 3:44
2

vperm2f128 (AVX1) and vperm2i128 (AVX2) run the same on every AVX2 CPU. I don't think there's extra bypass latency on any real CPUs for using the f128 version between AVX2 integer instructions, but it's probably a good idea to use the i128 version - it shouldn't ever be worse than vperm2f128, although it can be worse than vpermq depending on the CPU.
– Peter Cordes
Commented Jun 2, 2021 at 3:46
1

both run at same speed everywhere - that's something I'm not 100% sure about. It's possible some CPUs could have extra latency if you use vperm2f128 between vpaddb ymm, ymm instructions for example. So if you're using other __m256i intrinsics that also require AVX2, use _mm256_permute2x128_si256 or _mm256_permute4x64_epi64. If you're using __m256 or __m256d in a function that only requires AVX1 (and maybe FMA), it's not worth making a separate AVX2 version just for vpermpd, unless you want to tune for Zen1 specifically (taking into account its 128-bit vector hardware).
– Peter Cordes
Commented Jun 2, 2021 at 4:09

| Show 5 more comments

Paul R · Accepted Answer · 2011-08-27 15:42:22Z

3

The only way that I know of doing this is with _mm256_extractf128_si256 and _mm256_set_m128i. E.g. to swap the two halves of a 256 bit vector:

__m128i v0h = _mm256_extractf128_si256(v0, 0);
__m128i v0l = _mm256_extractf128_si256(v0, 1);
__m256i v1 = _mm256_set_m128i(v0h, v0l);

answered Aug 27, 2011 at 15:42

Paul R

212k37 gold badges399 silver badges570 bronze badges

2

Do you know the difference between _mm256_extractf128_si256 and _mm256_extracti128_si256? The only thing I can tell is that the first one works with AVX and the second requires AVX2. Why would anyone ever use the second version. I look at Agner Fog's instruction tables and latency, throughput, and ports are identical. Maybe I should ask this as a question.
– Z boson
Commented Sep 5, 2014 at 8:40
1

I thought I'd already seen this asked somewhere on SO but a quick search didn't turn it up - AFAIK they are effectively the same.
– Paul R
Commented Sep 5, 2014 at 9:32
@Zboson: oops - just found the question I mentioned above - I should have searched for the instructions rather than the intrinsics: stackoverflow.com/questions/18996827/…
– Paul R
Commented Sep 5, 2014 at 11:06
I believe this way is slower than Mark's answer, since extractf and set each have lat 3, throughput 1.
– mafu
Commented Apr 26, 2017 at 3:13
1

@mafu: yes, true - note also that clang (and perhaps other compilers) is smart enough to convert the above into a single vperm2f128, making it essentially the same as Mark's answer.
– Paul R
Commented Apr 26, 2017 at 6:05

| Show 1 more comment

Collectives™ on Stack Overflow

How can I exchange the low 128 bits and high 128 bits in a 256 bit AVX (YMM) register

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
x86
simd
avx
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged x86simdavx or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
x86
simd
avx
or ask your own question.