Efficient Context-Sensitive Pointer Analysis for C Programs

Posted on 2019-08-01 Edited on 2021-05-22 In paper

Introduction

Point analysis promises significant benefits for optimizing and parallelizing compilers. However, the analysis must be efficient without sacrificing the accuracy of results. Moreover, the pointer analysis must handle real C programs.

Our approach is based on the insight that procedures are typically called with the same aliases among their inputs. Our idea is to generate incomplete transfer functions that only cover the input conditions that exists in the program. These incomplete transfer functions are made up of simple partial transfer functions (PTFs) that are only applicable to calling contexts that exhibit certain alias relationships.

Major Concepts

Partial Transfer Functions

int f (int *p, int *q, int *r) {
    *p = *q;
    *q = *r;
}

int x, y, z;
int *x0, *y0, *z0;

int main() {
    x0 = &x;
    y0 = &y;
    z0 = &z;

    if (test1)
        f(&x0, &y0, &z0);
    else if (test2)
        f(&z0, &x0, &y0);
    else
        f(&x0, &y0, &x0)
}

The example function int f(int *, int *, int *) illustrates two points. First, the alias among the inputs to a procedure determine its behavior. Second, even for this simple two-statement procedure the complete summary is fairly complex;

The main idea behind our algorithm is that we need not summarize a procedure for all possible aliases among its inputs. Instead, we only to find summaries that apply to the specific alias patterns that occur in the program.

Design of PTFs

There is a tradeoff between the complexity of the individual PTFs and their applicability. Whenever the analysis encounters another call to the procedure with the same input values, it can reuse the final values recorded in the PTF, which known as memorization.

Since it is not the specific input values but their alias patterns that determine the behavior of a procedure, we use symbolic names called extended parameters to abstract the input values. An extended parameter represents the locations reached through an input pointer at the beginning of a procedure.

Extended parameters’ three important roles:

Extended parameters make up part of the name space of a procedure.
The initial points-to function for a PTF specifies the input domain.
The final points-to function at the procedure exit summarizes the pointer assignments.

Besides the alias, the input domain of a PTF is also defined by the value of function pointers passed in and used in calls within the procedure. Because they determine what code could be executed.

Algorithm Outline

When we begin analyzing a procedure, the initial points-to function is empty and the parameter mapping only records the actual values for the formal parameters. When we need to know the initial value of an input pointer and it is not already recorded, we add an entry to the initial points-to function. To check for aliases, we need to look up the initial values in the calling context. The process continues recursively up the call graph until the pointer values are known.

When the iterative analysis encounters a procedure call, it needs to find the effects of the call on the points-to function. if the call is through a pointer passed as an input to one or more of the PTFs on the call stack, we add the potential values of that pointer to the specifications of the input domains for those PTFs. For each potential callee procedure, we first check if any of its PTFs apply. This involves building up a parameter mapping and comparing the input alias to those recorded for the PTF.

Pointer Representations

Instead of using types to identify locations that may contain pointers, we assume that any memory location could potentially hold a pointer. we also refer to locations within a block of memory by their positions, not by their field names. We define a new abstraction, the location set, to specify both a block of memory and a set of positions within that block.

We divide memory into blocks of contiguous storage, whose positions relative to one another are undefined. A block of memory may be one of three kinds: a local variable, a locally allocated heap block, or an extended parameter (global variables).

We distinguish heap blocks based on their allocation context, grouping together all the blocks allocated in each context. The minimal context information is simply the statement that creates the block.

Location Sets

Our goal with regard to aggregates is to distinguish between different fields within a structure but not the different element of an array. Our pointer analysis can be combined with data dependence tests to distinguish between different array elements.

A location set is a triple (b, f, s) where the base b is the name for a block of memory, f is an offset within that block, and s is the stride. That i, it represents the set of locations {f + is | i ∈ Ƶ} within block b.

Expression	Location Set
scalar	(scalar, 0, 0)
struct.F	(struct, f, 0)
array	(array, 0, 0)
array[i]	(array, 0, s)
array[i].F	(array, f, s)
struct.F[i]	(struct, f % s, s)
*(&p + X)	(p, 0, 1)

Extended Parameters

Every object is represented by at most one extended parameter. When adding an entry to the initial points-to function we first find the values of the pointer in the calling context. If the parameter mapping shows that an existing parameter already includes the same values, we reuse the old parameter instead of creating new one.

For simplicity and efficiency, each initial points-to entry points to a single extended parameter. In cases where the initial values are aliased with more than one parameter, we create a new extended parameter that subsumes all the aliased parameters, and we replace all references to the subsumed parameters. Likewise, when the initial values are aliased with an existing parameter but also include new values, we subsume the aliased parameter with new one.

Aliases involving fields of structures can be a bit more complicated. When updating the initial points-to function, if we find a pointer to a field of an existing parameter, we can easily record that fact. However, if we encounter a pointer to a field before a pointer to the enclosing structure, we will create an extended parameter to represent the field. We simply allow the location set offsets to be negative.

Points-to Functions

A points-to function is a mapping from location sets to sets of location sets. It is important for the analysis to know which locations may contain pointers. Since we do not use the high-level type, and since the points-to function only contain entries for pointers that have been referenced, we record separately for each block of memory the location sets that may hold pointers.

Within the pointer analysis itself, that information can enable strong updates, where assignments overwrite the previous contents of their destinations. We only need that information at the point where a pointer is dereferenced.

Analyzing a Procedure

int EvalProc(proc *pr, PTF *ptf) {
    /* iteratively analyze a procedure */
    do {
        changed = false;
        foreach cfgnode nd in pr.flowgraph {
            if (no predecessors of nd evaluated)
                continue;
            if (nd is meet) EvalMeet(nd, ptf);
            if (nd is assign) EvalAssign(nd, ptf);
            if (nd is call) EvalCall(nd, ptf);
        }
    } while (changed)
}

Strong Updates

Unless we know that an assignment will definitely occur and that it assigns to a unique location, we must conservatively assume that location potentially retains its old values.

Sparse Representation

Since we analyze heap data as well as global and stack variables, many possible memory locations could be included in the points-to functions. Fortunately, the information stored in the point-to function is very sparse.

Because of the sparse point-to function representation, looking up the values of a pointer requires searching back for the most recent assignment to that location. Beginning at the current node, we search back through the dominating flow graph nodes.

Evaluating Dereferences

Because location sets may overlap, more than one location set may refer to the same location. Value assigned to one location set must be observed by references to overlapping locations. Thus, when a pointer is dereferenced, we iterate through all of the overlapping locations and look up in the current points-to function the values that have been assigned to each one. However, if the location being referenced is a unique location, values assigned to overlapping locations may have been overwritten by a strong update.

Evaluating Assignments

An assignment node in a flow graph specifies both the source and destination expressions, as well as the size of the value to be assigned. When building the flow graph from the intermediate representation of a program, we automatically convert the assignments to a “points-to” form. That is, since a variable reference on the right-hand side if an assignment refers to the contents of that variable, we add extra dereference to each expression on the right-hand side.

The process of evaluating an assignment begins by finding the locations identified by the source and destination expressions.

Interprocedural Algorithm

To evaluate the effects of a call on the points-to function, we need to find a PTF that applies in the calling context.

Calls Through Pointers

The first step in evaluating procedure calls is to determine the target procedures. Function pointer values may be expressed in terms of extended parameters. This is one situation where we need to know the specific values represented by extended parameters.

Testing if a PTF Applies

The input domain of a PTF is specified by its initial points-to function and by the value of the parameters used as call targets. To test if a PTF applies, we check if these things are the same in the current context as they were when the PTF was created.

Zhangtong's Notebook