Skip to main content

Questions tagged [sse]

SSE (Streaming SIMD Extensions) was the first of many similarly-named vector extensions to the x86 instruction set. At this point, SSE more often a catch-all for x86 vector instructions in general, and not a reference to SSE without SSE2, SSE3, etc. (For Server-Sent Events use [server-sent-events] tag instead)

0 votes
1 answer
21 views

Why CSAPP say Gcc do not use vcvtss2sd?

Computer Systems: A Programmer's Perpective (3rd), in section 3.11.1, say "Suppose the low-order 4 bytes of %xmm0 hold a single-precision value; then it would seem straightforward to use the ...
TouXianGuan's user avatar
3 votes
1 answer
85 views

Twice as slow SIMD performance without extra copy

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...
Alex's user avatar
  • 586
0 votes
0 answers
16 views

How to implement real-time responses in a Flask-based chatbot with OpenAI Assistants API?

I have a basic chatbot that currently waits for the backend to fully process and generate a response before displaying it to the user. During this wait, the user sees a "Typing..." message. ...
Josh's user avatar
  • 1
1 vote
0 answers
95 views

Speed-up byte signature scanning in memory using SIMD

I'm searching for various byte patterns in big memory chunks using this code: BOOLEAN Find(const unsigned char* data, SIZE_T data_size, const unsigned char* to_find, SIZE_T to_find_size, SIZE_T* index)...
Kracken's user avatar
  • 682
8 votes
0 answers
150 views

Why does removing instructions from my SSE intrinsic function make it slower?

Please note that this question is not about YUV422 to RGB conversion! I have this code for a pixel order YUV422 to RGB conversion. static void yuv422ToRGB(unsigned char* img, int width, int height, ...
Crigges's user avatar
  • 1,233
0 votes
0 answers
69 views

GCC generates slow code when targeting more recent sse version

I have very simple test program like below. Just sum all uint8 values in array. GCC seems to generate significantly slower code when targeting sse4 or avx2. Code is significantly faster with ssse3. Is ...
AdamF's user avatar
  • 2,561
1 vote
0 answers
97 views

How to compile with RAD Studio C++ Builder 12 BCC64 using AVX, SSE, F16C extensions?

I'm just compiling some plain C code under C++ Builder 12 for x64 (the compiler called from the IDE is BCC64.EXE) and when I enable some macros in third party headers related to CPU extensions like ...
rafastar's user avatar
1 vote
0 answers
84 views

save xmm registers in windows kernel

I am working on a Windows kernel-mode driver and needed to perform floating-point operations using the xmm registers (xmm0, xmm1, and xmm2) To avoid interfering with the kernel or other drivers state, ...
daniel's user avatar
  • 45
3 votes
2 answers
136 views

Zero remaining Bytes after first Zero in SSE Register

For this question, I will use the notation 1 for a byte with all ones (0xFF) and 0 for a byte with all zeros. I am looking for a way to zero the remaining bytes in a SSE register after the first zero ...
Crigges's user avatar
  • 1,233
2 votes
0 answers
56 views

Custom kernel: Stack unaligned, fault on compiler-generated SSE movaps [duplicate]

I'm seeing a weird problem with my kernel where XMM instructions fail as RSP 16 byte alignment constraint is unmet. The function frame starts with an aligned value but as it makes space for the buffer,...
Tretorn's user avatar
  • 397
1 vote
0 answers
22 views

How to identify the proportion of frequency reduction of a process caused by AVX instructions?

Different types of AVX instructions can cause a decrease in CPU frequency[1]. The proportion of this decrease can be evaluated through the PMU events called `CORE_POWER.LVL0/1/2_TURBO_LICENS. However, ...
Frontier_Setter's user avatar
0 votes
0 answers
85 views

compiler generated assembler

A question about compiler generated assembler: My to-be-optimized main loop includes two memory accesses instead of register. loop: mov xmm, mem // pre-calculated value pushed on the stack pxor xmm, ...
linuxCowboy's user avatar
2 votes
1 answer
88 views

Is there anything more I need to do before using SSE instructions?

I attempted to use an SSE instruction after I enabled the CR4 register bit 18(OSXSAVE) and xsetbv, but it is not working. The CPU has triggered the INT 0x6 interrupt(#UD). Is it because I didn't do ...
sanzenyou's user avatar
0 votes
1 answer
62 views

Set Last Value in __m128 vector register

So I have a set of data with mixed values for packing purposes that goes like this: {(Point_x, Point_y, Point_z, Scalar), (Point_x, Point_y, Point_z, Scalar), (Point_x, Point_y, Point_z, Scalar), ......
yosmo78's user avatar
  • 591
0 votes
0 answers
33 views

Vector by Scalar Division with -ffast-math

typedef float float4 __attribute__((vector_size(16))); float4 divvs(float4 vector, float scalar) { return vector / scalar; } compiles to // x86 gcc/clang -O3 shufps xmm1, xmm1, 0 divps ...
bockyboh's user avatar

15 30 50 per page
1
2 3 4 5
158