Hardekopf & Lin: Sparse Flow-Sensitive Pointer Analysis
Background
Pointer analysis determines which pointers (or heap references) may refer to which memory locations. Alias analysis is the related question: whether two expressions can refer to the same location. Together these underpin many security analyses (e.g., detecting use-after-free or taint flows).
There are several dimensions of precision:
- Flow sensitivity: distinguishes program points in order—knows “before” vs “after” a statement rather than merging all effects. This reduces spurious aliasing.
- Context sensitivity: distinguishes different call sites of the same function so information doesn’t bleed across unrelated invocations.
The core challenge is: doing a fully flow-sensitive, scalable, and precise analysis over large codebases is expensive. Hardekopf & Lin’s algorithm gets full flow sensitivity efficiently by combining:
- A cheap, over-approximating flow-insensitive bootstrap (AUX).
- A sparse, versioned representation of memory and pointer operations (full SSA for pointers, including heap).
- A single flow-sensitive fixpoint solve over that sparse def-use graph (DUG).
High-level pipeline
In sequence:
- Flow-insensitive AUX pointer analysis: conservatively approximate which loads and stores might interact; resolve indirect calls; seed the interprocedural CFG.
- Memory/pointer SSA & partitioning: build SSA form for all pointer variables (including address-taken/heap), partition heap variables into equivalence classes, and insert φ/χ/μ to version heap updates and merges.
- Construct the sparse Def-Use Graph (DUG): keep only relevant operations (ALLOC, COPY, LOAD, STORE, φ/χ/μ) and connect them with labeled/unlabeled edges encoding dataflow.
- Flow-sensitive points-to solve: propagate precise points-to sets over the DUG until a fixpoint is reached.
- Optional refinement: feed tighter results back to refine call graph / SSA and rerun for marginal gains.
1. Preprocessing — Building the Sparse Def-Use Graph (DUG)
1.1 AUX pass (flow-insensitive)
- Quickly computes a conservative over-approximation of which stores (definitions) might reach which loads (uses).
- Resolves indirect calls (function pointers, virtual-like dispatch) to potential targets, enabling a provisional interprocedural CFG (ICFG).
- Translates function calls and returns into explicit
COPY
instructions for parameter passing and return values.
1.2 Top-level SSA & partitioning
- Translate all top-level (register/local) variables into SSA form. φ-functions become multi-input
COPY
nodes (e.g.,x₁ = φ(x₂,x₃)
→x₁ = x₂ x₃
). - Group address-taken (heap or indirect memory) variables into equivalence classes called partitions, so memory locations with similar access patterns are tracked together.
1.3 Label loads and stores
- For each partition
P
: - Annotate each
STORE
that may modifyP
with a χ-function:P = χ(P)
. - Annotate each
LOAD
that may read fromP
with a μ-function:v = μ(P)
. - These χ/μ annotations capture where heap definitions and uses need SSA merging.
1.4 Heap-SSA on partitions
Run an SSA construction over each partition, placing φ/χ/μ nodes so that heap-related definitions and uses are versioned. Now every heap memory operation is in strict SSA form, separating “before” and “after” effects.
1.5 Construct the DUG
- Nodes: one for each ALLOC, COPY, LOAD, STORE, and SSA φ/χ/μ.
- Edges:
- Unlabeled: for top-level variable dataflow (ALLOC/COPY/LOAD → uses).
- Labeled by partition P: from a STORE with χ(P) to any LOAD or φ/μ/χ using P.
- Unlabeled: from φ nodes of partitions to their users.
2. Data structures for the sparse solver
- Worklist: initialized with all ALLOC nodes.
- Global points-to graph
PG
: holds points-to sets for top-level variables. For a top-level variablev
,Ptop(v)
is its set. - Per-node IN/OUT graphs:
- Each LOAD or φ node
k
has anINk
storing points-to setsPk(v)
for address-taken variables reaching it. - Each STORE node
k
hasINk
andOUTk
graphs for address-taken variables it may define.
- Each LOAD or φ node
All these graphs start empty. The partition function part(v)
maps an address-taken variable to its partition.
3. Sparse flow-sensitive fixpoint iteration
while Worklist ≠ ∅:
k = removeOne(Worklist)
switch (kind of node k):
case ALLOC:
let x = &o
Δ = {o}
PG[x] ∪= Δ
propagate Δ along all unlabeled outgoing edges from k
case COPY:
let x = y z …
Δ = PG[y] ∪ PG[z] ∪ …
PG[x] ∪= Δ
propagate Δ along all unlabeled outgoing edges
case LOAD (v = μ(P)):
IN_k[P] = ⋃_{p ∈ part(P)} PG[p] // gather incoming top-level info
Δ = IN_k[P]
propagate Δ along all edges labeled with P
case STORE (P = χ(P)):
IN_k[P] = ⋃_{p ∈ part(P)} PG[p]
OUT_k[P] = (IN_k[P] without old stored targets) ∪ new stores
Δ = OUT_k[P] ∖ IN_k[P]
propagate Δ along all edges labeled with P
case φ (on partition P):
IN_k[P] = union of incoming IN/OUT for P
Δ = IN_k[P]
propagate Δ along all unlabeled outgoing edges
// propagation: if target's graph grows, re-add that node to Worklist
Propagation means: for each target node reachable from k
via a DUG edge, merge Δ into its corresponding graph (PG
or IN
), and if that target gained new info, put it back on the worklist.
Why this is flow-sensitive and sparse
Flow-sensitive: The DUG edges, combined with SSA versioning, encode the causal order—uses pull from the precise definitions in effect at that program point. Updates don’t retroactively affect earlier uses because each definition created a new version.
Sparse: The algorithm never walks the full control-flow graph. It only propagates along the tiny DUG containing pointer-relevant operations, so irrelevant statements are skipped entirely.
This staging—one cheap flow-insensitive pass to bootstrap, then one precise flow-sensitive solve on a compressed graph—is what gives scalability to millions of lines of code.
Comparison / notes
Chase et al. proposed dynamically maintaining SSA during flow-sensitive pointer analysis; their idea wasn’t evaluated directly, but related approaches (e.g., Tok et al.) showed scalability limits when updating SSA dynamically. Hardekopf & Lin instead compute SSA upfront (with AUX guidance) and then solve sparsely, avoiding repeated full reformation of SSA, yielding practical performance on large programs.
Mathematical formulation (intuition)
Interpretation: Starting from conservative seeds (from AUX), propagate along the sparse edges; because the transfer function is monotonic over the finite lattice of pointsto sets, the iteration converges to a least fixed point.