Shining Light on Shadow Stacks

Introduction

Code pointers can be divided into two categories: backward edge, i.e., return address or forward edge pointers, such as function pointers or virtual table pointers. Control-Flow Integrity protects forward edges and assumes that the backward edges are protected. However, stack canaries and safe stacks are the strongest backward edge protection available in mainline compilers, and both are easily bypassed by information leaks.

Shadow stacks enforce stack integrity, protecting against stack pivot attacks and overwriting return addresses. Shadow stacks store the return address in a separate, isolated region of memory that is not accessible by the attacker. Upon returning, the integrity of the program return address is checked against the protected copy on the shadow stack. Two shadow stack designs have been proposed: compact shadow stacks, which rely on a separate shadow stack pointer, and parallel shadow stacks, which place the shadow stack at a constant offset to the orignal stack. These existing shadow stack designs suffer from a combination of poor performance, i.g., greater than 5%, and difficulty supporting C and C++ programming paradigms, i.g., multi-threading and exception handling.

Shadow stack design space

For any shadow stack mechanism to be adopted in practice, it must be highly performance, compatible with existing code, and provide meaningful security.

Shadow stack mechanisms are defined by how they map from the program stack to the shadow stack. This includes the type of mapping, as well as how the mapping is encoded in the protected binary.

Shadow stack mechanisms

Direct mapping schemes for parallel shadow stacks use the location of the return address on the program stack to directly find the corresponding entry on the shadow stack. The parallel shadow stack is as large as the program stack, and a simple offset maps from the program stack to the shadow stack.

Indirect mapping schemes for compact shadow stacks maintain a shadow stack pointer, equivalent to the stack pointer used for the program stack. The shadow stack pointer points to the last entry on the shadow stack, exactly as the stack pointer does for the program stack. Maintaining a shadow stack pointer allows a compact shadow stack to allocate significantly less memory, as only room for the return address is required, instead of duplicating the program stack.

Parallel and compact shadow stacks have different compatibility implications. if calls and returns were always perfectly matched, there would be no difference. However, the setjump / longjump functionality of C, which allows jumping mutiple stack frames back up the stack, and the equivalent stack unwinding capability used by C++ for exception handling, both break the assumption of perfectly matched calls and returns. Consequently, this leads to additional overhead for indirect shadow stack mapping schemes, while having no effect on direct mapping schemes.

Shadow stack mechanisms based on register.

All 64 bit architecture have at least 16 general purpose registers, and it is possible to dedicate a general purpose register to the shadow stack mechanism.

Parallel shadow stack mechanisms: The existing mechanism places shadow stack entries at a constant offset from the program stack, which is very efficient but higher memory overhead and lower security. However, the compatibility concerns arise from requiring a constant offset, which is limited to 32 bits for immediate operand in x86, from the program to the shadow stack from all threads, severely constraining the address space layout for programs with many threads, such as browsers.

To mitigate the compatibility and security concerns, we propose a new parallel shadow stack mechanism, which encodes the offset in a dedicated register, allowing the offset to the shadow stack to be determined at runtime. Further, the offset may vary from thread to thread as register are thread as register are thread local, and the offset can be set when the thread is created.

Compact Shadow Stack Mechanisms: For compact shadow stack mechanisms, the key question is where to store the shadow stack pointer. The shadow stack pointer will be dereferenced twice in every function: once in the prologue to push the correct return address, and once in the epilogue to pop the shadow return address.

The earliest approach of a shadow stack scheme with a dedicated register focuses on x86_32. We rejuvenate this idea for 64 bit architectures as gerneral purpose register provide the fastest possible option for storing the shadow stack pointer.

Shadow stack implementations

Each of the shadow stack mechanisms we evaluate is implemented as a backend compiler pass in LLVM 7.0.0. In particular, each shadow stack mechanism must instrument calls and returns to update its shadow stack and validate the return address before using it to transfer control. We show that the best way to accomplish this is to instrument function prologues and epilogues.

Instrumented locations

The instrumentation is responsible for pushing the return address to the shadow stack, and updating the shadow stack pointer for compact shadow stacks. Returns must be instrumented to pop from the shadow stack and validate the program return address in the function epilogue before the control-flow transfer to mitigate control-flow hijacking attacks.

The elegant solution for instrumenting calls is to place the protection in the function prologue. In this way, the function is protected, not particular call sites. instrumenting function prologues and epilogues maintains the symmetry of calls and returns naturally, as each will be executed for every function call.

On x86, instrumenting the function prologues results in a one-instruction wide Time Of Check To Time Of Use (TOCTTOU) opportunity due to architecture limitations. The call instruction pushes the return address to the stack where it may be modified by an attacker before it is picked up by the prologue in the called function. Architectures, such as ARM, where the address of the called function is stored in a register, do not have this limitation.

Our proposed mechanism halves the attack window as we jump to the verified address, but not fully mitigate the TOCTTOU window. Intel Control Enforcement Technology (CET) introduces a shadow stack based on hardare and compiler support. This extension will mitigate the TOCTTOU window on x86 and simplify the required instrumentation.

Stack unwinding mechanisms such as longjmp and c++ exceptions require additional instrumentation for compact shadow stacks. We must be able to unwind to the correct point on the shadow stack as well. Simply matching return addresses does not suffice for this, as the same address can show up multiple times in the call stack due to, e.g., recursive calls. To deal with this, our compact shadow stack implementations also push the stack pointer, i.g., rsp. The stack pointer and return address uniquely identify the stack frame to unwind to, allowing our mechanisms to support stack unwinding.

For the shadow stack mechanisms that use a register to encode the shadow stack mapping, ensuring compatibility with unprotected code constrains our selection of register. Our implementations use r15 in practice.

Runtime support

Our runtime library is responsible for allocating the shadow stack, and hooking setjump and longjmp. We add a new function in the pre_init array that initializes the shadow stack for the main program thread. This function also initializes the shadow stack pointer for compact shadow stack mappings.

For compact shadow stack mappings to support multi-threading and libunwind, we preload a small support library. It intercepts calls to pthread_create and pthread_exit to set up and tear down shadow stacks for additional threads.

Shadow stack epilogue optimizations

Traditionally, shadow stacks have relied on compare instructions to validate the shadow return address and program return address are equivalent, which potentially leads to pipe line stalls even with branch prediction. Consequently, as an optimization, we explore two different methods to optimize this validation.

To replace the compare instruction, we propose an xor of the program return address and shadow return address. This will result in 0 bits anywhere the two are identical, and 1s elsewhere. x86 has an instruction, popcnt, that returns the number of bits set to 1. Consequently, if the popcnt of the xor of the program return address and shadow return address is 0, then two are equivalent.

We leverage the memory managment unit (MMU) to compare the popcnt to zero as a side effect by creating a protection fault: triggering a page fault.

  1. By shifting this value left 48 and or it into the return address, we create a general purpose fault for a non-canonical address from if its value is not zero.

  2. The Last Byte in Page (LBP) scheme create two pages in memory, the first of which is mapped read write, the second of which has no permissions. We then attempt to read from the first page at the address of the last valid byte, plus the popcnt value. If the popcnt value is zero, we read the last byte of the valid page, otherwise we read from the guard page, causing the MPU to return a fault.

Hardware integrity mechanisms

Once a shadow stack design has been chosen, the shadow stack mechanism must guarantee the integrity of the shadow stack. Integrity guarantees are best provided by hardware solutions, which offer greater security and performance than software solutions, and can be as generic.

Existing hardware mechanisms take two different approaches to encoding access privileges to provide integrity protection. (i) MPK which encodes access privileges in each thread’s register file, providing per thread integrity, and (ii) MPX which encodes access in the individual instructions, so that access privileges are the same across all threads and depend only on the execution instruction

Thread centric solutions

Thread centric solutions operate by changing the permissions on the page of the protected memory region. Adding write permissions elevates the thread’s privileges, thereby creating a privileged region that is able to modify the protected memory region, i.e., shadow stack.

Memory Protection Keys (MPK) aims to address this by providing a single, unprivileged instruction that can change page access permissions on a per thread basis.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Read Write (disable all MPKs)
mov $0, %eax
xor %ecx, %ecx
xor %edx, %edx
wrpkru

# protection is off, write to the shadow stack
....
# Read Only (enable write disable bit for shadow stack)
mov $8, %eax
xor %ecx, %ecx
xor %edx, %edx
wrpkru

Note that the wrpkru instructions requires edx and ecx to be set to 0. Consequently, for functions which take more that two arguments, it is necessary to preserve the original values of these registers.

Code centric solutions

The Intel ISA extension Memory Protection Extension (MPX) provides a hardware mechanism that can be used to implement segmentation in a flexible manner. MPX segmentation schemes divide writes into two categories, those that are privileged to write into the protected region, and all others. This approach is surprisingly performant.

Evaluation

We believe that Shadesmar, a compact, register based shadow stack that directly uses the shadow RA, and relies on information hiding to protect the shadow stack, is the best candidate for adoption by mainline compilers based on our initial experiments with SPEC CPU 2006.