High level overview of Symbolic Execution.
Introduction
Symbolic Execution
, a program analysis technique first discussed in 1976, has gained traction among security researchers in the past 5 years.
As we have discussed in the previous post, this is mostly used to aid fuzzers towards a certain goal, usually code coverage.
In practise, Dynamic Symbolic Execution, also known as Concolic Execution
is more prevalent.
This post serves to discuss the high level details of Symbolic/Concolic Execution without going into implementation details, which will be saved for the next post.
Basic Symbolic Execution Theory
What are symbolic values?
Symbolic values are unique values that each represents a set of concrete values(e.g the set of 32-bit integers).
By supplying a symbolic value as input, each value can generate the results equivalent to all the concrete values in that set.
Upon reaching branching statements, we can form an equation with the symbolic values.
For example:
1 | input integers x, y |
Instead of using concrete values, we can use symbols x0
and y0
as input to variables x and y
Then, let the value of t after the if statament be t0
.
Up to line 9, t0
can have 2 values:
1 | t0 = x0 if x0 > y0 |
x0 > y0
and x0 <= y0
are called path conditions.
A path condition must be satisfied for that path to be taken.
In line 9, two more path conditions are collected for two more paths.
1 | t0 < x0 |
To trigger a crash, we need the first path condition to be true.
Now we have 3 equations.
1 | t0 = x0 if x0 > y0 |
We can feed these equations into a Satisfiability Modulo Theories(SMT) solver to give us the concrete inputs that satisfy this relationship.
These concrete inputs will trigger a crash.
SMT Solver Theory
Given a logical formula, the SMT solver will either produce satisfying assignments(concrete values) or tell you it’s unsatisfiable.
In practise, it might also tell you it does not know how to solve it.
The SMT solver works by having a database of boolean constraints based on your logical formula in Conjunctive Normal Form(CNF) clause.
For example, the conditions x>5 AND y<5 AND (y>x OR y>2)
can be turned into boolean constraint x1 AND x2 AND (x3 OR x4)
.
The solver assumes a state for an unknown at random, then goes through the constraints to propagate that state’s implication on other unknowns.
For example, if it meets a constraint (x3 OR x4)
and x3 is FALSE, we know x4 must be TRUE.
If all constraints are satisfiable, amazing. Repeat the process and assume a state for another unknown.
When it reaches an unsatisfiable constraint due to the assumed states, it analyses all current states and adds a new conflict clause to the database so this unsatisfiable situation does not happen again.
For example as shown above, x3
is FALSE and x4
is TRUE. Another constraint (x3 OR -x7)
is present, and since x3
is FALSE, x7
must be FALSE also.
Now imagine in the next round of assumption, we come across the constraint (x6 OR x7)
and we assume x6
to be FALSE, so x7
must be TRUE
.
Now we have a problem, because x7
will be both FALSE and TRUE, which is against the rules of boolean algebra.
In this case, a constraint clause like (x3 AND x6)
will be added, to make sure such a situation is avoided.
This process is repeated until all constraints are solved, or a halt condition is hit.
This is the Boolean SAT solver part of the SMT solver.
If it successfully finds states to solve for all the boolean constraints, it will try to generate concrete values that matches the states.
For example, if state x1
is TRUE, a value of 6 might be generated since 6 > 5 is TRUE.
This is the theory solver part.
Challenges of Symbolic Execution
Path Explosion
In the small example above, we tried to collect all the path conditions and only evaluate satisfiability at the end.
This is not going to work in real life code. Since the number of paths increase exponentially with number of branches, this approach will lead to “path explosion”, where there are simply too many paths to keep track of and evaluate.
Instead, we can invoke the SMT solver to check constraints and try to eliminate infeasible paths at every branch.
A more refined solution is Dynamic Path Pruning
(see references section for paper), which can be used to mitigate path explosion.
Complex Libraries
Although we can mitigate path explosion to a certain extent, real life symbolic execution is still extremely infeasible when complex library functions come into the equation.
Just imagine the ten thousand lines of code in a modern malloc
implementation. Symbolically executing through that everytime is painful and will get nowhere.
Researchers have proposed to use Abstract Modelling
to replace library functions with a simplified model.
For example, the malloc
function might be simplified to returning a pointer in a block of memory and incrementing the pointer.
Although rudimentary, the effects will be similar on a small scale if free
is not taken into account.
Unsatisfiable Constraints
For paths that are long and involves many conditions, SMT solvers may not be able to solve and find satisfying assignments.
Other paths may be short but involves computations that are outside the capabilities of the solver, such as non-linear arithmetic or cryptographic functions.
In this case, Concolic Execution
can be useful.
Concolic Execution
Concolic testing overcomes the problem mentioned above by combining concrete execution (i.e. testing) with
symbolic execution.
Symbolic execution is used to solve for inputs that lead along a certain path.
However, when a part of the path condition is infeasible for the SMT solver to handle, we substitute values from a test run of the program.
In many cases, this allows us to make progress towards covering parts of the code that we could not reach through either symbolic execution or randomly generated tests.
Using the snippet below as example:
1 | int foo(int v) { |
If we use symbols x0
and y0
as input:
Up to the first statement, symbolic execution gives usz = y0 * y0 % 50
The path condition will be x0 = y0 * y0 % 50
This formula is not linear in the input y0
, and so it may defeat the SMT solver and we cannot move forward.
We can address the issue by treating foo
, the function that includes nonlinear computation, concretely instead of symbolically.
We use concrete random value x=100
, y=10
as input.
Now we get z = foo(y)
, which is z = 0
.
One path condition is x0==z
, which is x0==0
.
The other path condition is x0!=0
Now we solve the two paths one by one, starting with the first which leads us into the if statement
It’s easy to solve to get x0 == 0
We replace our initial x=100 with x=0
in order to continue.
Now at the next branch, again we have two path conditions:
The first is x0==foo(y0) AND x0 > y0+10
and the inverse of that is x0==foo(y0) AND x0 <= y0+10
.
We try to solve for the first path again.
Since the first part involes an unwanted function, we will symbolically solve the second part, enumuerating all possible sets of x0
and y0
.
Then, we will concretely test the condition by feeding all generated values of x0
and y0
into foo
until we find a match.
We find x0=16
and y0=4
, and move on the path.
This input leads to ERROR, and we’ve found the bug.
One con about this method is that it’s difficult to know when to stop analysing, because we do not try to obtain any reasoning regarding the foo
function.
If the foo function is just some random number generator that has no logic, we will be stuck generating an infinite sequence of x0
and y0
trying to satisfy it.
Of course, all of these analyses work at either the binary level(like SAGE) or intermediate representation(IR) level(like KLEE).
Pros of IR:
- Easy to implement with small number of high level instructions
- Architecture agnostic
Cons of IR:
- Slower execution
Pros of binary code:
- Faster execution
Cons of binary code:
- Architecture dependent with huge amount of instructions to emulate.
Conclusion
Here are some real life(production) concolic execution tools:
- KLEE
- SAGE(close source)
- Triton
- angr
In essence, concolic/symbolic execution are theoretically feasible, but often faced with scalability issues such as path explosion and unsolvable constraints when implemented against real production softwares.
As stated in the SAGE paper, Blackbox(regular fuzzing) is simple, lightweight, easy, and fast but can yield limited code coverage.
Whitebox(symbolic execution aided fuzzing) is smarter but more complex.
Apart from finding bugs, symbolic executors are also able to triage crashes(perhaps found through traditional fuzzing).
I will research more into this and maybe write it up as a future blog post.
Designing a production level symbolic executor require an insane amount of optimization, as well as binary instructions modelling and linking up with libraries to perform code coverage analysis, solve SMT and many more.
As such, there aren’t many mature tools that are used by independent researchers to find bugs.
Most prototypes from research papers are also hardly scalable, and the research around symbpolic execution seems to be in all directions.
Building one from scratch is extremely complex and costly, so I will not venture there until I get a chance to maybe build one as part of research when I enter university.
A question I still have is, since the goal of Symbolic Execution is to increase code coverage and explore all paths of a program, how is it able to find subtle bugs such as double free/overflows without running a concrete input?
To answer this question, my next post should be a half reading/half lab post regarding KLEE
.
EOF
References
- https://madhu.cs.illinois.edu/cs598-fall10/king76symbolicexecution.pdf
- https://www.youtube.com/watch?v=mffhPgsl8Ws
- https://www.cs.cmu.edu/~aldrich/courses/17-355-18sp/notes/notes14-symbolic-execution.pdf
- https://www.cs.cmu.edu/~aldrich/courses/17-355-19sp/notes/notes15-concolic-testing.pdf
- https://github.com/Resery/-Symbolic-Execution-For-Bug-Hunting-in-Binaries
- https://www.csie.ntu.edu.tw/~hchsiao/pub/2018_IEEE_DSC.pdf
- https://queue.acm.org/detail.cfm?id=2094081