The Windows x64 ABI (Application Binary Interface) presents some new challenges for assembly programming that don’t exist for x86. A couple of the changes that must be taken into account can can be seen as very positive. First of all, there is now one and only one OS specified calling convention. We certainly could have devised our own calling convention like in x86 where it is a register-based convention, however since the system calling convention was already register based, that would have been an unnecessary complication. The other significant change is that the stack must always remain aligned on 16 byte boundaries. This seems a little onerous at first, but I’ll explain how and why it’s necessary along how it can actually make calling other functions from assembly code more efficient and sometimes even faster than x86. For a detailed description of the calling convention, register usage and reservations, etc… please see this. Another thing that I’ll discuss is exceptions and why all of this is necessary.
For an given function there are three parts we’re going to talk about,
the prolog, body, and epilog. The prologue and epilogue contain all the
setup and tear-down of the function’s “frame”. The prolog is where all
the space on the stack is reserved for local variables and, different
from how the x86 compiler works, the space for the maximum number of
parameter space needed for all the function calls within the body. The
epilog does the reverse and releases the reserved stack space just prior
to returning to the caller. The body of a function is where the user’s
code is placed, either in Pascal, or as we’ll see this is where your
assembler code you write will go.
You may be wondering why the prolog is reserving parameter space in
addition to the space needed for local variables. Why not just push the
parameters on the stack right before calling a function? While there is
technically nothing keeping the compiler from placing parameters for a
function call on the stack immediately before a call, this will have the
effect of making the exception tables larger. As I mentioned above,
exceptions in x64 are not implemented the same as in x86, which was a
stack-based linked list of records. In x64, exceptions are done using
extra data generated by the compiler that describes the stack changes
for a given function and where the handlers/finally blocks are located.
By only modifying the stack within the prolog and epilog, “unwinding”
the stack is easier and more accurate. Another side benefit is that when
passing stack parameters to functions, the space is already available
so the data merely needs to be “MOV”ed onto the stack without the need
for a PUSH. The stack also remains properly aligned, so no extra
finagling of the RSP register is necessary.
Directives
Delphi for Windows 64bit introduced several new assembler directives or “pseudo-instructions”, .NOFRAME, .PARAMS, .PUSHNV, and .SAVENV. These directives allow you to control how the compiler sets up the context frame and ensures that the proper exception table information is generated.
.NOFRAME
Some functions never make calls to other functions. These are called “leaf” functions because the don’t do any further “branching” out to other functions, so like a tree, they represent the “leaf” For functions such as this, having a full stack frame may be extra overhead you want eliminate. While the compiler does try and eliminate the stack frame if it can, there are times that it simply cannot automatically figure this out. If you are certain a frame is unnecessary, you can use this directive as a hint to the compiler.
.PARAMS <max params>
This one may be a little confusing because it does not refer to
the parameters passed into the current function, rather this directive
should be placed near the top of the function (preferably before any
actual CPU instructions) with a single ordinal parameter to tell the
compiler what the maximum number of parameters will be needed
for all the function calls within the body. This will allow the compiler
to properly reserve extra, properly aligned, stack space for passing
parameters to other functions. This number should reflect the maximum
number of parameters for all functions and should include even those
parameters that are passed in registers. If you’re going to call a
function that takes 6 parameters, then you should use “.PARAMS 6”.
When you use the .PARAMS directive, a pseudo-variable @Params becomes
available to simplify passing parameters to other functions. It’s fairly
easy to load up a few registers and make a call, but the x64 calling
convention also requires that callers reserve space on the stack even
for register parameters. The .PARAMS directive ensures this is the case,
so you should still use the .PARAMS directive even if you’re going to
call a function in which all parameters are passed in registers. You use
the @Params pseudo-variable as an array, where the first parameter is
at index 0. You generally don’t actually use the first 4 array elements
since those must be passed in registers, so you’ll start at parameter
index 4. The default element size is the register size of
64bits, so if you want to pass a smaller value, you’ll need a cast or
size override such as “DWORD PTR @Params[4]”, or “ @Params[4].Byte”.
Using the @Params pseudo-variable will save the programmer from having
to manually calculate the offsets based on alignments and local
variables. UPDATE: I foobar’ed that one… The
@Params[] array is an array of bytes, which allows you to address every
byte of the parameters. Each parameter takes up 8 bytes (64bits), so
you’ll need to scale accordingly to access each parameter. Casting or
size overrides are still necessary. The above bad example should have
been: “DWORD PTR @Params[4*8]” or “ @Params[4*8].Byte”. Sorry about that.
.PUSHNV <GPReg>, .SAVENV <XMMReg>
According to the x64 calling convention and register usage spec, there
are some registers which are considered non-volatile. This means that
certain registers are guaranteed to have the same value after a function
call as it had before the function call. This doesn’t mean this
register is not available for usage, it just means the called function
must ensure it is properly preserved and restored. The best place to
preserve the value is on the stack, but that means space should be
reserved for it. These directives provide both the function of ensuring
the compiler includes space for the register in the generated prolog
code and actually places the register’s value in that reserved location.
It also ensures that the function epilog properly restores the register
before cleaning up the local frame. .PUSHNV works with the 64bit
general purpose registers RAX…R15 and .SAVENV works with the 128bit
XMM0..XMM15 SSE2 registers. See the above link for a description of
which registers are considered non-volatile. Even though you can specify
any register, volatile or non-volatile as a parameter to these
directives, only those registers which are actually non-volatile will be
preserved. For instance, .PUSHNV R11 will assemble just fine, but no
changes to the frame will be made. Whereas, .PUSHNV R12 will place a
PUSH R12 instruction right after the PUSH RBP instruction in the prolog.
The compiler will also continue to ensure that the stack remains
aligned. Remember when I talked about why the stack must remain 16byte
aligned? One key reason is that many SSE2 instructions which operate on
128bit memory entities require that the memory access be aligned on a
16byte boundary. Because the compiler ensures this is the case, the
space reserved by the .SAVENV directive is guaranteed to be 16byte
aligned.
Writing assembler code in the new x64 world can be daunting and
frustrating due to the very strict requirements on stack alignment and
exception meta-data. By using the above directives, you are signaling
your intentions to the one thing that is pretty darn good at ensuring
all those requirements are met; the compiler. You should always ensure
the directives are placed at the top of the assembler function body
before any actual CPU instructions. This makes sure the compiler has all
the information and everything is already calculated for when it begins
to see the actual CPU instructions and needs to know what the offset
from RBP where that local variable is located. Also, by ensuring that
all stack manipulations happen within the prolog and epilog, the system
will be able to properly “unwind” the stack past a properly written
assembler function. Without this data, the OS unwind process could
become lost and at worst, skip exception handlers, or at worst call the
wrong one and lead to further corruption. If the unwind process gets
lost enough, the OS may simply kill the process without any warning,
similar to what stack overflows do in 32bit (and 64bit).