Enforcing kernel security invariants with data-flow integrity


More recently, researchers have demonstrated the feasibility of enforcing Control-Flow Integrity (CFI) in kernel space. However, in addition to problems discovered in CFI, a more fundamental problem is that because OS kernel are mainly data-driven, CFI can be easily by passed by non-control-data attacks.

The objective of this work is to provide a defense system that is both principled and practical. We achieve the first goal by utilizing Data-Flow Integrity. Similar to CFI, DFI guarantees that runtime data-flow cannot deviate from the data-flow graph generated from static analysis. For example, data from a string buffer should never flow to the return address on the stack (control-data) or the uid (non-control-data). In this work, we focus on enforcing invariants that are related to kernel access control mechanisms (a.k.a., reference monitors) so as to defeat privilege escalation attacks.

While this approach sounds intuitive at high level, enforcing it with practical runtime performance overhead is very challenging. First, access control checks are scattered through out the kernel, and related data are mixed with other data. Therefore, the protection must be deployed kernel-wide. Moreover, without hardware support, software-based DFI implementation can be every expensive.

We propose a system called KENALI, which consists of two key techniques. Our first technique, INFERDISTS, is based on the observation that although the protection has to be kernel-wide, only a small portion of data is essential for enforcing the two security invariants. For ease of discussion, we refer to this set of data as distinguishing regions. Our technique, INFERDISTS, overcomes these challenges with a new program-analysis-based approach. Specifically, by leveraging implicit program semantics, INFERDISTS is able to infer security checks with out any manual annotation. After this, by considering both data and control-dependencies of data that can affect each security check, as well as sensitive pointers, INFERDISTS generates a complete set of distinguishing regions.

Our second technique, PROTECTDISTS, is a new technique to enforce DFI over the distinguishing regions. PROTECTDISTS uses a two-layer protection scheme. The first layer provides a coarse-grained but low overhead data-flow isolation that prevents illegal data-flow from non-distinguishing regions to distinguishing regions. After this separation, the second layer the enforce fine-grained DFI over the distinguishing regions. Because the access pattern to most data in the distinguishing regions is very asymmetric, read accesses are usually magnitudes more frequent than write accesses, PROTECTDISTS employs the Write Integrity Test (WIT).

Problem scope

Threat model and assumptions

We only consider attacks that originate from unprivileged code. The attacker can read and write word-size value at an arbitrary memory address.


  1. Simple rooting attacks. CVE-2013-6282 allows attacker to read and write arbitrary kernel memory, which matches our adversary model. (i) Retrieving the address of prepare_kernel_cred() and commit_creds(). Depending on the target system, they can be at fixed address, or obtainable from the kernel symbol table (kallsyms_addresses); (ii) Invoking prepare_kernel_cred() and pass the results to commit_creds(), then the kernel will replace the credential of the current thread with one of root privilege.

  2. Bypassing CFI with non-control-data attacks. The above attack can be prevented by kernel wide CFI. But CFI can be easily bypassed by locating the cred structure and overwriting the euid field. The cred structure can be located in many ways. (i) if kallsyms is available and contains the address of init_task, we can traverse the process list to locate the task_struct of the current process, then task_struct->cred (ii) if there is a vulnerability that leaks the stack address, attackers can directly obtain the address of the thread_info, then follows the links to locate the task_struct.

  3. Diversity of non-control-data attacks. From the target kernel we evaluated, we found that 2419 data structures contain critical data.

Technical approach

Inferring distinguishing regions

The challenge is that for security, our solution must be sound (i.e., no false positives), but for performance, we want the size of the inference result to be as small as possible.

  1. Control-Data

  2. Non-Control-Data: Access controls are implemented as security checks, and while a kernel may have many security checks scattered throughout different components, they all follow one consistent semantic: if a security check fails, it should return a security related error codes. Leveraging this observation, we can collect security check without manual annotation. Then, distinguishing regions can be constructed via standard dependency analysis over the conditional variables of security check.

int acl_permission_check (struct inode * inode, int mask) {
unsigned int mode = inode -> i_mode;

if (current_cred -> fsuid == inode -> i_uid)
mode >>= 6;
else if (in_group_p (inode -> i_gid))
mode >>= 3;

if ((mask & ~mode & (MAY_READ | MAY_WRITE | MAY_EXEC)) == 0)
return 0;

return -EACCES;

We must consider branches that dominate a security check. However, naively including all dominators will introduce many false positives. To reduce false positives, INFERDISTS conservatively excludes two cases: (i) a branch can lead to non-related error return and (ii) a branch instruction is post-dominated by either a security check or checks in (a).

  1. Sensitive pointers: pointers to sensitive regions must be protected as well; otherwise, attackers can indirectly control the data in distinguishing regions by manipulating the pointers.

Protecting distinguishing regions

The challenge for this step is how to minimize the performance overhead on commodity processors that lack support for fine-grained data-flow tracking. To address this challenge, our key observation is that, after separating the memory into distinguishing region and non-distinguishing region, there could be three types of data-flow: (i) within non-distinguishing region, (ii) between two regions, and (iii) within distinguishing regions.

We design a two-layer scheme: lightweight data-flow isolation on second type, and more expensive DFI enforcement to prevent illegal data-flow of the third type.

  1. Data-flow isolation: We explored the feasibility of hardware-based data-flow isolation for AArch64. We developed a novel, virtual address space-based isolation mechanism by tagging the TLB with an identifier (i.e., Process Context ID)

  2. Write Integrity Test: In addition to preventing illegal data-flow from non-distinguishing regions to distinguishing regions, we use DFI to prevent illegal data-flow within distinguishing regions. We leveraged the write integrity test.

  3. Shadow Objects: As a hardware protection unit (e.g., page) may contain both distinguishing and non-distinguishing regions, once we write-protect that page, we also have to pay additional overhead for accessing non-distinguishing regions. Our solution to this problem is shadow objects: a normal copy for non-distinguishing regions and a shadow copy for distinguishing regions.

  4. Safe Stack: Using shadow stack.

A prototype for Android

Data-flow isolation

  1. AArch64 VMSA (Virtual Memory System Architecture): The AArch64 supports a maximum of 48-bit virtual address (VA), which is split into two parts: the bottom part is for user space, and the top part is for kernel space.

  2. Shadow Address Space: Our data-flow isolation implementation leverages the ASID (Address Space IDentifier) tagging feature and is based on shadow address space. Under this isolation scheme, the same physical page is mapped into two different address spaces with different permissions.

  3. Atomic Primitive Operations: Considering the security of the shadow address space, we make every operation under the shadow address space atomic, such as write a single data, memcpy and memset.

MMU integrity

Three additional security invariants: (i) MMU isolation: Only MMU management code can modify MMU-related data. (ii) Code integrity: Kernel code integrity. (iii) Dedicated entry and exit: MMU management code is always invoked through dedicated entries and exits.

Shadow objects

  1. SLUB Allocator: In our target kernel, most distinguishing regions are allocated from the SLUB allocator. When SLUB allocates pages for a new slab, we allocate a shadow slab of the same size and map it read-ony at a fixed offset (4GB) from the original slab.

  2. Global Objects: While most distinguishing regions are dynamically allocated, some of them are statically allocated in the form of global objects. We allocate shadow memory for entire .data section, copy all the contents to populate the shadow memory and then map it.

  3. Analysis and instrumentation: We first identify all allocation sites for objects in distinguishing regions and modify the allocation flag. Next, we identify all pointer arithmetic operations for accessing distinguishing regions and modify them to access the shadow objects. Finally, we identify all write to distinguishing regions, and modify them to invoke our atomic operations instead. inline assembly needs to be handled manually.

Kernel stack randomization

We map kernel stack to a unused AV above the logical map (top 256GB), because kernel stacks are small (16KB), we have around 24-bit of entropy available.

We contain the risk of information leak as follows. First, we mark functions like copy_to_user as unsafe. Then, we store the real stack pointer in a redirection table and replace the original pointer with an index into the table, which mapped as inaccessible under normal context. So does the page tables of shadow stack.


Distinguishing regions discovery

For control data, our analysis identified 6192 code pointers, 991 are function args, 11 are return addresses, both of them are protected by safe stack. 1490 are global variables, and 3699 are fields over 783 data structures.

For non-control data, the error codes we used were EPERM, EACCES, and EROFS. 526 functions as capable of returning permission-related errors; 1077 function args, 279 global variable, and 1731 data fields over 855 data structures as security-critical.

For sensitive pointers, we combined both control and non-control data to result a total of 4096 data fields over 1316 data structures as the sensitive pointer inference. Futher, there are 4002 fields over 1103 structures as distinguishing regions. For the target kernel, we should protect about 27.30% of all data structures.