7
\$\begingroup\$

Following exercise:

Write a program that takes a number (of size 4 bytes) x as input, and then reverses all the bits of x, and outputs the result. By reversing all bits we mean that the bit with original location i will move to location 31-i. Small example (for the 8 bit case): if x == {01001111}_2, then the output is {11110010}_2. In this example we reversed only 8 bits. Your program will be able to reverse 32 bits.

Full exercise-description can be seen here: XORPD GitHub

I tinkered out the following idea. Please take into account my comments too.

format PE console
entry start

include 'win32a.inc' 

; ===============================================
section '.text' code readable executable

start:
    mov     eax,    0x4f    ; 0x4f is equal to 01001111 (from the exercise-description example).
    mov     cl,     0x1f    ; cl becomes the control variable. 0x1f == 31 decimal
    xor     edx,    edx     ; edx will accumulate the different states of ebx during the runtime.
process_bit:
    shr     eax,    0x1     ; Kick the right-most bit out ...
    jc      add_one         ; If it was a 1 jump to 'add_one' ... 
    mov     ebx,    0x0     ;  ... otherwise write a 0 ...
    jmp     now_rotate      
add_one:
    mov     ebx,    0x1
now_rotate:
    rol     ebx,    cl      ; The right-most bit has been fresh written. Now move it n-positions to the left.
    or      edx,    ebx     ; "Save" or "Add" the current positive bits (1-bits) of ebx to edx.
    loop    process_bit

    mov     eax,    edx
    call    print_eax_binary

    ; Exit the process:
    push    0
    call    [ExitProcess]


include 'training.inc'

I guess it works right.

screenshot program-result

Please compare the result on the screenshot with the example-value from the exercise-description.

What to think about my solution?

Is it valid? Or does it have to be improved?

Is there a better way to solve the described task?

Looking forward to read your hints and comments.

\$\endgroup\$
2
  • 1
    \$\begingroup\$ I really hope that the ch register is initially zero. Does the Windows ABI guarantee this? \$\endgroup\$ Commented Jun 26, 2017 at 20:17
  • \$\begingroup\$ @RolandIllig Not sure concerning ch indeed. Thanks for the hint. \$\endgroup\$ Commented Jun 27, 2017 at 2:02

5 Answers 5

4
\$\begingroup\$

In assembler it is often possible to avoid jmp instructions. This is done to improve branch prediction and thereby performance.

Your code can be written without the jump to add_one:

xor ebx, ebx
shr eax, 1
adc ebx, 0

These three instructions replace the whole jumping. The adc (add with carry) instruction's original purpose is to support chained addition, but it can be used creatively for many other purposes. Note that I had to move the xor ebx, ebx to the top since it updates the carry flag.

There are also the rcr and rcl instructions that efficiently use the carry flag as a one-bit register. You can just repeat 32 times:

rcr eax, 1
rcl ebx, 1

And you're done. When you inline this loop, you have a constant-time operation that finishes in 64 instructions. Each of them probably takes a single cycle, therefore 64 cycles. That's already acceptable, but there are probably better ways.

On the electrical, hardware level, the bitswap operation can be implemented by just swapping the wires, therefore chances are high that there is some machine instruction that makes this task more efficient. Just read through the whole processor manual to see if there is anything related. Have a look for the keywords "swap", "shift", "mask".

If you want to get a really fast program, you should use the bswap eax instruction, followed by code that reverses the bit order. This can probably be found in the excellent book Hacker's Delight.

The basic idea is to take every second bit and shift it to the left. At the same time, take the remaining bits and shift them to the right. Like this:

bits0 = (x & 0x55555555) << 1
bits1 = (x >> 1) & 0x55555555
x = bits0 | bits1

Then do the same thing with groups of 2 bits, and then once more with groups of 4 bits. The groups of 8 and 16 bits are done by the bswap instruction, you don't need to implement them on your own.

\$\endgroup\$
2
  • \$\begingroup\$ shr eax, 1 like in the original would work instead of rcr eax, 1 -- though the resulting value in eax would be discarded anyway so it doesn't matter much. \$\endgroup\$
    – ecm
    Commented Dec 10, 2019 at 15:38
  • \$\begingroup\$ @ecm Thanks for the edit and for suggesting shr instead of rcr. Indeed, when I looked up the latency and throughput for the rcr reg, 1 instruction, I was surprised that it is not always 1 cycle and 1 latency, even though this instruction can be implemented by just routing some wires. \$\endgroup\$ Commented Dec 10, 2019 at 15:41
3
\$\begingroup\$

Generally looks correct. A certain improvement would be to use a rotate-with-carry instruction:

    process_bit:
        shr eax, 0x01
        rcl ebx, 0x01
        loop process_bit
\$\endgroup\$
3
\$\begingroup\$

Two things caught my eye.

1) Looking at this:

process_bit:
    shr     eax,    0x1     ; Kick the right-most bit out ...
    jc      add_one         ; If it was a 1 jump to 'add_one' ... 
    mov     ebx,    0x0     ;  ... otherwise write a 0 ...
    jmp     now_rotate      
add_one:
    mov     ebx,    0x1
now_rotate:
    rol     ebx,    cl      ; The right-most bit has been fresh written. Now move it n-positions to the left.

When you want to zero a register, normally you would do that with xor ebx, ebx instead of mov ebx, 0x0 since this is slightly smaller/faster. Also, what about something like this:

process_bit:
    xor     ebx,    ebx     ; Start at zero
    shr     eax,    0x1     ; Kick the right-most bit out ...
    adc     ebx,    0x0     ; Add the shifted bit to ebx
    rol     ebx,    cl      ; The right-most bit has been fresh written. Now move it n-positions to the left.

2) Then there's this:

mov     eax,    edx
call    print_eax_binary

If you know that you are going to need a specific value in eax, why not structure your code such that it ends up there in the first place? If appears that if you swap the usage of eax and edx, you can avoid this move.

PS Regarding Roland's somewhat cryptic remark regarding ch:

If ch isn't zero (for example if it is 7), when you do mov cl, 0x1f, then the ecx register will contain 0x71f. This becomes a problem, since loop uses ecx, not cl.

\$\endgroup\$
3
\$\begingroup\$

Here's the 16-bit 8086 source to my bit mirror function, inputs are bx:dx for the size (question wants 32 there) and dword [hhvar] for the data to be bit-mirrored, output is in bx:dx. Like vnp also suggested I use rcl. I shift right bx:dx and then rotate-with-carry-left into si:di.

of_bit_mirror:
        mov cx, dx
        test bx, bx
        jnz .large
        cmp dx, byte 64
        jb .normal
.large:
        xor bx, bx              ; mirror count 64 or higher:
        xor dx, dx              ;  all 32 bits mirrored with (nonexistent) zero bits
        retn
.normal:
        mov dx, word [hhvar]
        mov bx, word [hhvar+2]
        cmp cl, 1
        jbe .ret                ; mirror count one or zero, return input -->
        push si
        push di

        push cx
        mov di, -1
        mov si, di
.loopmask:
        shl di, 1
        rcl si, 1
        loop .loopmask          ; create mask of bits not involved in mirroring
        and si, bx
        and di, dx              ; get the uninvolved bits
        pop cx

        push si
        push di                 ; save them
        xor si, si
        xor di, di              ; initialize mirrored register
.loop:
        shr bx, 1
        rcr dx, 1               ; shift out of original register's current LSB
        rcl di, 1
        rcl si, 1               ;  into other register's current LSB
        loop .loop
        pop dx
        pop bx                  ; restore uninvolved bits
        or bx, si
        or dx, di               ; combine with mirrored bits

        pop di
        pop si
.ret:
        retn

In relation to your exercise, this solution allows dynamically selecting the width of the bit mirror operation, so that the 8-bit example "if x == {01001111}_2, then the output is {11110010}_2" can be run by inputting 8 in bx:dx and 0100_1111b = 4Fh in the variable.

I also decided to mask out the uninvolved (higher) bits and return them unchanged.

Both of my loops only have one branch for the loop itself each, whereas your loop's code branches quite a lot, including a conditional branch. Conditional branch prediction is a notoriously difficult part of the processor, so less branches are better. (Other answers also eliminated the conditional branch in several ways.)

Aside the conditional branch, you're also using an unconditional branch to now_rotate. This does not have the negative performance impact of a conditional branch, but it does make your solution be spaghetti code. The amount of labels you need (add_one) is another sign of spaghetti code. Here's how to reduce the spaghetti factor (while keeping the conditional branch):

        shr eax, 0x1    ; Kick the right-most bit out ...
        mov ebx, 0x0    ; Initialise register to a 0
        jnc now_rotate  ; If eax bit was a 0 jump to now_rotate
        mov ebx, 0x1    ; Reset register to a 1
now_rotate:

This streamlines the control flow. However, if ebx is initialised to zero then there is an easier way to get it to one. inc ebx is (in 32-bit code segments) a one-byte opcode:

        shr eax, 0x1    ; Kick the right-most bit out ...
        mov ebx, 0x0    ; Initialise register to a 0
        jnc now_rotate  ; If eax bit was a 0 jump to now_rotate
        inc ebx         ; Increment register to a 1
now_rotate:

And it is yet easier to initialise to zero by using "xor with itself", which has a shorter encoding than mov ebx, imm32, and is recognised as a zeroing idiom on modern processors. This clobbers the flags register, which means we cannot put it between the shift and the conditional jump. But we can swap the shift and zeroing to have the zero initialisation first:

        xor ebx, ebx    ; Initialise register to a 0
        shr eax, 0x1    ; Kick the right-most bit out ...
        jnc now_rotate  ; If eax bit was a 0 jump to now_rotate
        inc ebx         ; Increment register to a 1
now_rotate:

Below your label now_rotate you shift the single bit into a mask to OR into the destination register. I shift the destination register by one place in each iteration instead. Your way uses an additional register -- this is of course more of an issue for my 16-bit implementation that runs out of registers sooner. The additional shifting of the single bit into a mask, needing an up-to-31-places shift, and applying that mask to the destination would also be much more complex with 16-bit registers than my solution's double-width one-bit shift.

Your loop also shifts the source register to the right one place in each iteration, shifting a bit of interest into the Carry Flag. This is the same as in my solution. It would also be valid to shift to the left and shift the MSB into the Carry Flag, then rotate-with-carry-right that into the MSB of the destination register. The way we chose is easier to handle for arbitrary width bit mirroring though.

\$\endgroup\$
0
2
\$\begingroup\$

As the exercise doesn't require result to be stored, each digit can be displayed immediately.

    mov     ecx, eax      ; Copy dword to be displayed in reverse
@@: mov      al, cl       ; Move low nibble to AL
    and      al, 1        ; Isolate least significant bit
    or       al, '0'      ; Going to be either 0 or 1

   .....  Whatever is required here to display contents of AL

    shr     ecx, 1
    jnz     @b

The only difference with this algorithm is trailing zero's won't be displayed. Just a personal preference of mine, otherwise you could substitute ECX to be a counter of 32 and EDX the copy of EAX if the full 32 bits needs to be shown.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.