Call-Graph Construction
Problem definition
A call graph is an artifact produced by program analysis tools to record the relationships between a function and the functions it calls. Building a callgraph, like many static analyses, is known to be an undecidable problem, meaning there is no perfect solution to build a perfect static callgraph. It relies on tricks and techniques to approximate the best callgraph possible.
Numerous techniques exist to improve callgraph generation at compile time, such as Reach Analysis, Class Hierarchy Analysis, Rapid Type Analysis, etc. Most of these techniques are useful in an Object Oriented program paradigm; since this tool focuses on Rust, the solution differs a bit. (Maybe in the future support for OOP languages will be added.)
Call-graph analysis on Rust’s LLVM IR is easier because:
- Generics are monomorphised → almost every call is a direct
call @func_name
, not a virtual or template dispatch. - No class inheritance tree → you don’t need CHA or RTA; there’s no subtype hierarchy to walk.
- Only two kinds of indirect calls:
- Trait objects (dyn Trait) lowered to v-tables we can parse
- Function pointers tracked by simple points-to or metadata
Monomorphization is the process where the compiler takes a generic function or type (one written to work over any type T
) and, at compile time, generates a separate, concrete version for each actual type it’s used with.
We solve both in our fixed-point loop by:
- Seeding the points-to sets with:
- Every function pointer stored in a
@vtable.* global
(trait-object slots). - Every function whose address is “taken” and stored in some variable (raw function pointers).
- Every function pointer stored in a
- Propagating those seeds along all pointer‑flow edges (loads, stores, bitcasts, φ-nodes, GEPs).
- Resolving each indirect
call %p(...)
by looking up the final points-to setpts[%p]
and adding an edge to every function in it.
Because monomorphized generic calls show up as call @foo_u32
or call @foo_String
(i.e. direct getCalledFunction()
hits), the algorithm does nothing special for generics: they already appear as direct edges in step 1 of each iteration.
Key insight: we don’t just scan for direct calls—we first run a flow-insensitive points-to analysis over the entire IR to conservatively resolve all function-pointer targets. That initial points-to graph seeds our indirect‑call edges and lets us iterate to a much more precise call graph.
Plain text algorithm
We repeat these steps until we add no new edges:
-
Walk every basic block (using FI points-to info):
Prerequisite: have initial points-to sets for all function‑pointer variables, computed by a fast flow-insensitive pass.
- On a direct call, add an edge:
caller → callee
. - On an indirect call, look up the variable’s current FI points-to set (
PFI[p]
) and add a provisional edge to each potential target; remember this call-site for later refinement. - On any pointer operation (store, load, bitcast, GEP, φ-node, return, parameter), record that the source pointer can flow to the destination pointer.
- On a direct call, add an edge:
-
Pointer-tracking phase:
- Seed: Initialize each pointer’s set with its FI points-to results (all v-table slots & address-taken functions).
- Flood: Propagate those seeds along pointer‑flow edges to refine the points-to graph before each call-resolution pass.
-
Resolve indirect calls:
Using the refined points-to graph from Step 2…
- For each remembered indirect call site, look up its variable’s final function set.
- Build our graph by adding edges from the caller to each function in that set.
- If any new edge was added, repeat from step 1.
Pseudo-code
repeat until we add no new edges
1. walk every basic block
• direct call → add caller → callee edge
• indirect call → remember call-site + operand
• pointer copy → remember edge src ↦ dst
(store, load, bitcast, getelementptr, φ, return, param)
2. pointer-tracking phase
• seed: every v-table slot + address-taken function
• work-list flood:
while some P(v) just grew
for (v → w) in copy-edges push w
P(w) ← P(w) ∪ (new funcs)
3. resolve remembered indirect calls
• for each call-site c
for t in P(op(c)) add caller(c) → t
if any edge new repeat outer loop
Mathematical Formulation
Here, the least-fixed-point computation of P*
is exactly our FI points-to pass—its results feed directly into CG
to resolve indirect calls.
Notation:
- CG — The call graph—the set of caller→callee edges
- = — “equals” or “is defined as”
- { … } — A set
- ( , ) — An ordered pair or tuple
- caller — A function mapping each call‑site
c
to the function that contains it - c — The call‑site being considered
- t — A variable ranging over possible callee functions
- ∣ (“such that”) — Introduces the conditions that select which pairs belong to the set
- c ∈ C — The call‑site
c
is one element of the setC
of all call‑sites - , — “and” (both conditions must hold)
- t ∈ … — The target
t
must come from the following case‑analysis result - {…} — A two‑branch definition depending on whether
c
is direct or indirect - {callee(c)} — If
c
is direct, the singleton set containing the statically resolved callee - P* — The least fixed point of the points‑to mapping
P
- op(c) — The SSA value (pointer) used by the indirect call at call‑site
c
- P*(op(c)) — “All functions that the pointer
op(c)
may refer to”