I am porting SSE SIMD code to use the 256 bit AVX extensions and cannot seem to find any instruction that will blend/shuffle/move the high 128 bits and the low 128 bits.
The backing story:
What I really want is VHADDPS
/_mm256_hadd_ps
to act like HADDPS
/_mm_hadd_ps
, only with 256 bit words. Unfortunately, it acts like two calls to HADDPS
acting independently on the low and high words.
vextractf128
which is fast everywhere (especially Zen1), narrowing to 128-bit vectors. How to sum __m256 horizontally?. But you wouldn't wanthaddps
as part of an efficient horizontal sum in the first place, so hopefully that wasn't what you were doing... Unless you had multiple hsums to do, then yes, vhaddps can be useful like in Intel AVX: 256-bits version of dot product for double precision floating point variables. And maybe 2x vperm2f128 + vaddps