A compiler I'm writing generates the following x86-64 assembly (AT&T syntax) for a recursive factorial function. I convert the assembly into an ELF executable using gcc
. But, when I execute it, the output is always some garbage number and not the desired output 120
.
Source Code:
fn factorial(num: i32) -> i32 {
if num == 0 {
return 1;
} else {
return num * factorial(num - 1);
}
}
fn main() {
println(factorial(5));
}
Generated Assembly:
.globl factorial, main
.format_number:
.string "%d\n"
factorial:
pushq %rbp
movq %rsp, %rbp
subq $4, %rsp # allocating 4 bytes for parameter num
movl %edi, -4(%rbp)
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
movl -4(%rbp), %ebx
movl $0, %r10d
cmpl %r10d, %ebx
jne .L0
movl $1, %ebx
movl %ebx, %eax
jmp .factorial_epilogue
jmp .L1
.L0:
movl -4(%rbp), %ebx
andq $-16, %rsp
movl -4(%rbp), %r10d
movl $1, %r11d
subl %r11d, %r10d
movl %r10d, %edi
pushq %r10
pushq %r11
call factorial
popq %r11
popq %r10
movl %eax, %r11d
imull %ebx, %r11d
movl %r11d, %eax
jmp .factorial_epilogue
.L1:
.factorial_epilogue:
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
movq %rbp, %rsp
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $0, %rsp
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
andq $-16, %rsp
movl $5, %ebx
movl %ebx, %edi
pushq %r10
pushq %r11
call factorial
popq %r11
popq %r10
movl %eax, %r11d
andq $-16, %rsp
movl %r11d, %esi
leaq .format_number(%rip), %rax
movq %rax, %rdi
xor %eax, %eax
call printf@PLT
.main_epilogue:
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
movq %rbp, %rsp
popq %rbp
ret
.data
I thought this was an issue with stack alignment during function calls and fixed it with andq $-16, %rsp
before every function call to align the stack pointer to a 16 byte boundary. Now, memcheck
does not detect any errors but the output is still some garbage number. The same assembly was working fine when all the numbers were 8 byte values but I don't understand why it doesn't work when numbers are 4 byte values. Anyone know how to find what's wrong in the generated assembly?
Edit: Currently, the code generator attempts to align the stack by doing an andq
before every function call. Then, the function prologue pushes 5 * 8 bytes (callee saved registers) and also allocates the exact amount of space required for local variables. For example, it allocates 40 + 4 bytes if the function has an i32
parameter only.