Shuffler: Fast and Deployable Continuous Code Re-Randomization

Posted on 2019-04-29 Edited on 2021-05-22 In paper

Introduction

We propose a system, called Shuffler, which provides a deployable defense against JIT-ROP and other code reuse attacks. Other defenses have had significant barriers to deployment: some utilize a custom hypervisor; others involve a modified compiler, runtime, or operating system kernel. In comparison, Shuffler runs in userspace along side the target program, and requires no system modification beyond a minimal patch to the loader.

Shuffler operates by performing continuous code re-randomization at runtime, within the same address space as the program it defends. Additional, we bootstrap into a self-hosted and self modifying egalitarian environment — Shuffler always shuffles itself.

We achieve a shuffle period on the order of tens of milliseconds, so fast that is nearly impossible to form a complete exploit. Shuffler creates new function permutations asynchronously in a separate thread, and then atomically migrates program execution from one copy of code to the next. This migration requires a vanishingly small global pause time, as program threads continue to execute unhindered 99.7% of the time. Thus, if the host machine has a spare CPU core, shuffling at faster rates does not significant impact the target’s performance.

Our system operates on program binaries, analyzing them and performing binary rewriting.

Threat model

We assume that the protection against code injection (W^X) is in place, and that an x86_64 architecture is in use. Our system does not require (and, in fact, is orthogonal to) other defensive techniques.

Design

Architecture

Shuffer is design to require minimal system modifications. To aviod kernel changes, it runs entirely in userspace; to avoid requiring source or a modified compiler, it operates on program binaries. Performing re-randomization soundly requires complete and precise pointer analysis, we leverage symbol and relocation information from the (unmodified) compiler and linker.

At load-time, Shuffler transforms the program’s code using binary rewriting. The goal of rewriting is to be able to track and update all code pointers at runtime. We leverage our complete and accurate disassembly to transform all code pointers into unique identifiers —indices into a code pointer table. These indices cannot be altered after load time. We handle return addresses (dynamically generated code pointers) differently, encrypting them on stack rather than using indices.

Our system performs re-randomization at the level of functions within a specific shuffle period, a randomization deadline specific in milliseconds. Shuffler runs in a separate thread and prepares a new shuffled copy of code within this deadline. The vast majority of the re-randomization process is performed as asynchronously: creating new copies of code, fixing up instruction displacements, updating pointers in the code table. The threads are globally paused only to atomically update return addresses. Since any existing return addresses reference the old copy of code, we must revisit saved stack frames and update them.

To prevent our own code from being used in a code reuse attack, Shuffer randomizes it the same way it does all other code. In fact, our scheme uses binary rewriting to transform all code in a userspace application (the program, Shuffler, and all shared libraries) into a single code sandbox, essentially turning it into a staticlly linked application at runtime.

Challenges

Changing function pointer behavior. Normal program’s memory layout remains consistent and function pointers have indefinite lifetime. Re-randomization introduces an arbitrary lifetime for each block of code, and it becomes an exercise in avoiding dangling code pointers.

Hence, we need to accurately track and update every code pointer during the re-randomization process. We opt to statically transform all code pointers into unique identifiers—namely, indices into a hidden code pointer table. Then wherever the code pointer is copied throughout memory, it will continue to refer to the same entry in the table.

Some code pointers are dynamically generated, in particular, return addresses on the stack. We could dynamically allocated table indices, but call/ret pairs are highly optimized, and replacing them with table mechanism would involve a large performance degradation. Instead, we allow ordinary calls to proceed as usual, and at re-randomization time we unwind the stack and update return addresses to new values. Rather than leave return addresses exposed on the stack, we encrypt each address with an XOR cipher.

Augmented binary analysis. We propose a augment binary analysis, which involves analyzing program binaries that have additional information included by the compiler.

The common problems with binary analysis are distinguishing code from data, and distinguishing pointers from integers. To tackle these problems, we require that (i) the compiler preserve the symbol table, and (ii) that the linker preserve relocations. The symbol table indicates all valid call targets and makes disassembly straightforward—we iterate through symbols and disassemble each one independently. Reloactions are used to indicate portions of an object file (or executable) that needs to be patched up once its base address is known. Since each base address is initially zero, every absolute code pointer must have a relocation—but as object files are linked together, most code pointers get resolved and their relocations are discarded. We simply ask the linker to preserve these relocations.

bootstrapping into shuffled code. Shuffler defends its own code the same way it defends all other code. Shuffled code cannot start running until the code pointer table is initialized, requiring some unshuffled startup code. Shuffled and original code are incompatible if they use code pointers; the process of transforming code pointers to indices overwrites data that the original code accesses, and then the original code will no longer execute correctly. Hence, we would have to call new function as they became available, and carefully order the function-pointer rewrite process to avoid invalidating any functions currently on the call stack.

Instead, we opted for a simpler and more general solution. Shuffler is split into two stages, a minimal and a runtime stage. The minimal stage is completely self-contained, and it can safely transform all other code, including libc and the second-stage Shuffle. The it jumps to the shuffled second stage, which erases the previous stage (and all other original code). The second stage inherits all the data structures created in the first so that is can easily create new shuffled code copies.

Implementation

Code pointers are directed through the code pointer table and return address are stored on the stack, encrypted with an XOR cipher. In each shuffle period, Shuffler makes a new copy of code, updates the code pointer table and sends a signal to tell all threads (including its self); each thread unwinds and fixes up its stack. Shuffler waits on a barrier until all threads have finished unwinding, then erases the previous code copy.

Our Shuffler Implementation supports many system-level features, including shared libraries, multiple threads, forking, {set/long}jmp, system call re-entry, and signals.

Transformation to support shuffling

Code pointer abstraction. We allocate the code pointer table at load-time and set the base address of the GS segment at it. Then, we transform every function pointer at its initialization point from an address value to an index into this table. Jump tables are handled similarily, with indices assigned to each offset within a function that is used as a target.

Every instruction which originally used a function pointer value is rewritten to instead indirect through the %gs table. This adds an extra memory dereference. Since x86 instruction can contain at most one memory reference, if there is already a memory reference, we use the caller-saved register %r11 as scratch space. For (position-dependent) jump tables, there is no register we can safely overwrite, so we use a thread-local variable allocated by Shuffler as a scratch space (denoted as %fs: 0x88).

Return address encryption. We encrypt return address on the stack with a per-thread XOR key. We reuse the stack canary storage location for our key; our scheme operates similarly to stack canaries, but does not affect the layout of the stack frame. We add two instruction mov %fs:0x28, %r11; xor r11, (%rsp) at the beginning of every function and before every exit jump; after each call, we insert a mov instruction to erase the now-visible return address on the stack. We again use %r11 as a scratch register, since it is a caller-saved register according to the x86-64 ABI.

Displacement reach. A normal call instruction has a 32-bit displacement and must be within ± 2GB of its target to “reach” it. Shared libraries use Procedure Linkage Table trampolines to jump anywhere in the 64-bit address space.

Completeness of disassembly

While shuffling some libraries and programs, we encountered myriad special cases. The issues boil down to: (a) dealing with inaccurate/missing metadata, especially in the symbol table; (b) handling special types of symbols and relocations; and (c) discovering jump table entries and invocations.

Our major challenge is identifying whether relocations are part of jump tables, and distinguishing between indirect tail-recursive jumps and jump-table jumps. If we fail to realize a relocation in a jump table, we will calculate its target incorrectly and the jump will branch to the wrong location; if we decide that a jump table’s jump is actually tail recursive, we will insert return-address decryption instruction before it, corrupting %11 and scrambling the top of the stack.

GCC generates jump tables differently in position-dependent and position-independent code (PIC). Position-dependent jump tables use 8-byte direct pointers, and are nearly always invoked by an instruction of the form jmpq *(%rax, %rbx, 8) in any optimization level. PIC jump tables use 4-byte relative offsets added to the address of the beginning of the table—and the lea that loads the table address may be quite distant from the final indirect jump. To find PIC jump tables, we use outgoing %rip-relative references from functions as bounds and check if they point at sequences of relocation in the data section.

It is difficult to tell whether a jmpq *%rax instruction is used for indirect tail recursion, or a PIC jump table. We use a liner sweep to record push instructions in the function’s first basic block, and keep a log of the pop instruction seen since the last jump. If an indirect jump is preceded by pop instructions that are in the reverse order of the push instructions, we assume we have found a function epilogue and that the jump is indirect tail recursive.

bootstrapping and requirements

We carefully bootstrap into shuffled code using two libraries (stage 1 and stage 2), so that the system never overwrites code pointers for the module that is currently executing. The constructor of stage 1 is called before any other via the linker mechanism -z initfirst. Then, stage 1 make sure all other constructors run in shuffled code. The last constructor to be called is stage 2’s own constructor; stage 2 creates a dedicated Shuffler thread.

Compiler flags. We require the program binary and all dependent libraries to be compiled with -Wl, -q, a link flag that preserves relocations, and -gdwarf-2, when compiling with c++. Since we require symbols and DWARF unwind information, the user must avoid -s and -fno-asynchronous-unwind-tables.
System modifications. The -z initfirst loader feature currently only supports one shared library, and libpthread already use it. Since shuffled functions must be within ± 2GB of each other, we simplify Shuffler’s task and map all ELF PT_LOAD sections into the lower 32 bits of the address space. Finally, we disabled a manually-constructed jump table in the vfprintf of glibc.

Implementation Optimizations

Generating new code. The Shuffer thread maintains a large code sandbox that stores shuffled functions. In each shuffle period, every function within the sandbox is duplicated and the old copies are erased. The sandbox is split in half, so that one half may be erased with a single mprotect system call. We maintain servral buckets and each function is placed in a random bucket; when a bucket fills up, it is committed with an mprotect call and a fresh bucket is allocated.

We use a Binary Indexed Tree for function allocations. Our tree keeps track of all valid addresses for new buckets, storing disjoint intervals; it also tracks the sum of interval lengths.

Stack unwinding. We wrote a custom unwind library with a straightforward DWARF state machine.

Binary rewritting. Suffler’s load-time transformations are all implemented through binary rewriting. We disassemble each function with diStorm and produce intermediate data structures which we call rewrite blocks. Rewrite blocks are similar to basic blocks but may be split at arbitrary points to accommodate newly inserted instructions.

Security analysis

Analysis of traditional attacks

Normal ROP. Absolutely.

Indirect JIT-ROP. Indirect JIT-ROP relies on leaked code pointers and computes gadgets accordingly. Because code pointers are replaced with table indices, the attack cannot gather code pointers from data structures; nor can the attacker infer code pointers from data pointers, since the relative offset between code and data sections changes continuously.

Direct JIT-ROP. In direct JIT-ROP, the attacker is assumed to know one valid code address, and employs a memory disclosure recursively, harvesting code pages and finding enough gadgets for a ROP attack. The entire attack must be completed within the shuffle period of r milliseconds.

Blind ROP. BlindROP tries to infer the layout of a server process by probing it workers, which are forked from the parent and have the same layout. The attack uses a timing channel to inter information about the parent based on whether the child crashed or not. Shuffler easily thwarts this attack because it randomizes child and parent processes independently.