Hunter Praska's Blog

Reply to thread

hunter ###Admin 2018-11-29 20:19:47 No. 3

Is PUSH optimized on AMD64?

The AMD64 PUSH instruction writes a register to the stack and increments the stack pointer. It is semantically the same as

sub $8, %rsp
mov %rax, (%rsp)

So I got to wondering, is PUSH optimized, or is it just a macro? I decided to test it out. I benchmarked two procedures, the first using push/pop and the second using sub/add/mov.

# First procedure
push %rax
pop %rax

# Second procedure
sub $8, %rsp
mov %rax, (%rsp)
mov (%rsp), %rax
add $8, %rsp

There are probably a number of ways to test this; but since we're only testing a few instructions we need high precision. I decided to write a benchmark in Rust using the Criterion library as I've had a good experience with it. How do we benchmark a few instructions in Rust? Well you can use inline assembly on the nightly compiler but I'm not sure how stable it is. I decided to just JIT my instructions as I had experience with this from another project and knew it to be simple.

So prior to running the benchmarks, the program JITs the instructions and casts the allocated block to a fn() -> (). Thus the benchmarks will just be calling a procedure repeatedly.

These benchmarks will be run on an i5-2500k @ 4.0 GHz. The code can be found as a gist.


So what do we find? They're exactly the same! Using push we had a time of 1.8191 ns while using sub+mov we had a time of 1.8193 ns. I really didn't expect this to be the case, and it's such a short amount of time. Maybe there was some inaccuracy?

Let's try again, but this time we will repeat the instruction sequence for each procedure. Here are the times for different numbers of repetitions:

Repetitions Push (ns) Mov (ns)
1 1.8191 1.8193
2 3.7050 3.4527
4 6.6541 7.3608
8 13.183 14.283
16 25.557 29.037

This data seems to indicate that PUSH is indeed faster than using SUB+MOV.

"But wait!" you say. "You said you were going to benchmark PUSH vs SUB+MOV but this is a benchmark of PUSH+POP vs SUB+MOV+MOV+ADD!" Indeed, you are correct. So, I went back and used POP for both procedures. I won't post the data here, as it is the same as the Push column above. This seems to indicate that PUSH is indeed just an expansion of SUB+MOV.

Why then do the mov benchmarks become worse as the number of repetitions increases? My guess is that it comes down to code size. Each repetition for push is 2 bytes while each repetition for mov is 16 bytes. I think it's unlikely that this is a cache issue as the i5-2500k has a 64k L1 cache and 256k L2.

Are there any advantages to using PUSH over SUB+MOV then? Most definitely. PUSH is 1-2 bytes while SUB+MOV is 6-8 bytes. For such a common instruction, the reduction in code size is quite important as it allows more code to fit in cache.