Feature review of AFL and WinAFL
Introduction
Since we’ve already talked about the types of fuzzing quite some time ago(Fuzzing 1), let’s dive straight into the overview of American Fuzzy Lop
(AFL), a coverage-based evolutionary fuzzer.
In this post, we will go over the design inspiration of AFL
, the inner workings of AFL, as well as WinAFL
, a fork by ProjectZero to fuzz Windows binaries.
Inspiration
Before coverage-based evolutionary fuzzers were a thing in the late 2000s, researcher Tavis Ormandy(forgive me for making this sound like a history lesson… thought it’s quite disrespectful to leave out the name ;)) used gcov
to measure coverage of various testcases to find the one with most coverage, then fed that as corpus to a dumb fuzzer.
This act netted him quite a number of bugs, and sparked the thought “what if we performed this at a regular interval during the fuzzing session such that the test cases generated always favoured coverage”.
However, creators soon realised that this approach of explicitly calculating coverage is resource intensive, which led to it looking good on paper but infeasible in practise.
AFL’s main design principal then became “Simplicity and Useability”. For example, it just favours testcases that do something new, instead of actually reasoning coverage.
Ad Verbatim:
1 | In fact, at its core, it's designed to be just a very good traditional fuzzer with a wide range of interesting, well-researched strategies to go by. The fancy parts just help it focus the effort in places where it matters the most. |
AFL Details
Coverage
Just like symbolic execution, coverage is one of, if not the most important feedback metric on the journey of program exploration.
AFL collects coverage by using a 64kb
shared memory buffer. 64kb is chosen for efficient L2 cache access, and a decently low collision rate for most programs.
The pseudocode is like:
1 | cur_location = <COMPILE_TIME_RANDOM>; |
and is injected into every branch in the compiled source code, made possible by AFL’s custom implementation of compilers like gcc
.
Every byte set in the output map can be thought of as a hit for a particular (branch_src, branch_dst) tuple in the instrumented code.
Why is prev_location
divided by 2? This is to differentiate between paths of tight loops.
For example, a loop surrounding branch A
, such that the path produced is A->A
will be calculated as shared_mem[0]
, which is identical to that of path B->B
if the division does not exist.
When a mutated input produces an execution trace containing new tuples, the corresponding input file is favoured, preserved and routed for additional processing.
If no new tuples are found, the input is discarded, even if the overall flow is unique.
For example, gievn A->B->C
and C->A->B
already favoured, A->B->E
will be preserved while A->B->C->A
will be discarded, even though its flow is unique.
The fuzzer also considers coarse tuple hit counts, divided into several buckets:
1, 2, 3, 4-7, 8-15, 16-31, 32-127, 128+
Changes within the range of a single bucket are ignored(For example a loop cycling 40 times vs cycling 43 times); transition from one bucket to another is flagged as an interesting change in program control flow(a piece of code that was running once now runs twice, amazing!), and is favoured for evolution.
Evolution
These favoured test cases are added back into the input queue, and used as a corpus for next rounds of fuzzing.
AFL does not prioritise these newly added corpuses over the current queue of corpuses, and simply queues them behind the current corpuses. This design is proven more effective empirically.
If you imagine it in action, it’s similar to an ever shrinking circle, instead of a directed “spear” that pokes everywhere if we had prioritised these new mutations.
Among the queue of test cases, AFL periodically re-evaluates the queue using a fast algorithm that selects a smaller subset of test cases that still cover every tuple seen so far. It works by looping through the tuple, find the testcases that covers a tuple not in the small subset yet, select the smallest(file) and fastest(execution latency) testcase, then register all tuples that this testcase exercises into the subset. It repeats this process again until all tuples are iterated.
The generated corpus of “favored” entries is usually 5-10x smaller than the starting data set. Non-favored entries are not discarded, but they are skipped with varying probabilities when encountered in the queue.
1 | - If there are new, yet-to-be-fuzzed favorites present in the queue, 99% of non-favored entries will be skipped to get to the favored ones. |
A snippet:
1 | if (pending_favored) { |
This adds some brains to our shrinking circle, removing some redundancy and prioritises shrinking of certain parts for efficiency. Since it does not actually discard any testcase, accuracy is not sacrificed in the long run.
It also employs a minimization algorithm to actively try and reduce the size of the input testcase.
1 | The actual minimization algorithm is: |
A snippet of it:
1 | /* Select initial chunk len, starting with large steps. */ |
This algorithm is by no means elegant on paper, but it’s definitely effective and has a low overhead. For example, a bufferoverflow testcase that was initially abcdefghijklmnopqrs
can be minimized to 0000000000000000000
, clearly revealing that specific characters are not required to trigger that path.
Mutation
The original AFL engine consists of 2 stages of mutation.
The first stage is highly deterministic, consisting of sequential bit flips
, byte flips
, addition/subtraction of integers
and insertion of interesting values
(such as -1, max integer etc). When changes to a certain region of the testcase does not trigger a new tuple, they may be excluded in the following deterministic runs for efficiency.
1 | /* We also use this stage to pull off a simple trick: we identify |
The reason for this stage is to try and get a small difference between the initial corpus and crashing input so it’s easier for the user to reason about the final crashing testcase.
The second stage consists of stacked, random operations mentioned above, as well as splicing
of 2 testcases.
Dictionary
AFL supports the use of dictionaries to aid in fuzzing.
An example PNG dictionary looks like:
1 | header_png="\x89PNG\x0d\x0a\x1a\x0a" |
These keywords specified will be used by AFL when mutating testcases, in order to formulate correct header bytes efficiently to pass the format checks.
1 | if (!extras_cnt) goto skip_user_extras; |
This is quite necessary when reasoning about parsers with strict checksum, unless you want AFL to spend weeks discovering the correct format. As we will soon see in the next post, AFL++ improves this by utilizing REDQUEEN to efficiently overcome checksums and magic bytes without the use of symex.
Apart from allowing the user to customize a dictionary of keywords for the fuzzer to follow, AFL attempts to automatically build a dictionary by using logic to collect certain parts of the testcase that it deems as a “keyword”.
This blogpost by the creator of AFL explains it very well, so I’m just going to quote him.
https://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-grammar-with.html
1 | To explain the approach, it's useful to rely on the example of a PNG file. The PNG format uses four-byte, human-readable magic values to indicate the beginning of a section, say: |
Duplication Removal
AFL considers the crash as unique if the following two conditions are met:
1 | - The crash trace includes a tuple not seen in any of the previous crashes, |
A -C
flag passed to AFL can be used to explore crashes. AFL mutates these crashes and attempts to find results that are clearly controllable from the input. For example, a segfault at an address corresponding to the testcase’s data will probably signify an arbitrary jump, which is surely something to note.
1 | if (fault == crash_mode) { |
This however adds workload that might be unnecessary to the queue, so I don’t recommend using it. Exploration can be done separately after the crash cases are reported.
Forkserver and Persistent Mode
AFL supports “forkserver mode” to improve efficiency.
Essentially, it does nothing until the loader does all its work and the program is at its entry point. Then, it forks the current image and runs the fuzz on the forked process, such that just one round of loading is sufficient for the entire fuzzing session, and skips a lot of libc linking overhead.
It works by injecting a shim into the program being fuzzed. The shim is injected right before the program’s entry point, and opens a pipe to communicate with AFL. Once the fork instruction is received, it forks and resumes execution of the binary in the child process, while the parent monitors it.
1 | __afl_fork_wait_loop: |
You may ask, how is the fuzz payloads delivered? They are delivered via a file(since forked processes share fds). Each new forked process will rewind the fd, so it accesses the new fuzz payloads delivered.
1 | /* Isolate the process and configure standard descriptors. If out_file is |
With fast targets, the fork server can offer considerable performance gains, usually between 1.5x and 2x.
Persistent mode is just forkserver mode, but each process will be fed multiple testcases before it is terminated.
Binary Only Mode
In cases where source code is not available, AFL provides a binary only mode.
1 | qemu_mode ? "qemu " : "", dumb_mode ? " dumb " : "", |
which… quite obviously has significant overheads.
Instead of instrumenting native code, AFL will first use QEMU’s “user emulation” to interpret native code, while AFL instruments QEMU’s intermediate representation of the code.
The QEMU process is initialized then forked, in order to reduce startup overhead(quite massive). Another overhead is the translation of new(never seen before) blocks by QEMU. AFL adds the translated code to a cache, which is shared with future QEMU clones to reduce new translation attempts.
As a result of these two optimizations, the overhead of the QEMU mode is roughly 2-5x.
WinAFL
WinAFL is a fork of AFL by ProjectZero.
The original AFL project was quite *nix-centred, and can’t be used to test Windows binaries.
WinAFL modifies and extends AFL to make it Windows specific.
Persistent Mode
WinAFL does not support forkserver mode, since Windows does not provide a “fork” system service natively.
Instead, it relies heavily on Persistent Mode for speed improvements. Heavy initialization code should be performed in the main function of the harness, then a separate fuzz function is created to invoke the target. This fuzz function’s offset is fed to WinAFL, which iterates through this function for x number of times.
WinAFL is designed to run 10+ to 5000(user specified) different testcases in a single process in order to match the efficiency of forkserver mode in AFL.
That also means the harness needs to be clean and accurate, so not too much initialization and cleanup is done in the fuzz function, yet the target’s states stay generally clean in each iteration to not interfere with the new payload.
Shared Memory Payload Delivery
WinAFL adds the ability to share the fuzzing payload to the target program using shared memory, instead of from a file as we’ve discussed before with AFL.
1 |
|
First 4 bytes are size, and the following bytes are payload data.
The shared memory’s name is passed to our target program to replace the initial file name arguments.
1 | if (!use_sample_shared_memory) { |
This requires us to code a harness that explicitly reads in data from a shared memory. The name of the shared memory will be passed in via AFL command line, replacing the filename argument @@
.
Shared memory is faster and can avoid some problems with files (e.g. unable to overwrite the sample file because a target maintains a lock on it). It also reduces stress on disk.
Dynamorio Blackbox
WinAFL uses Dynamorio to perform dynamic instrumentation(like QEMU mode in AFL).
Dynamorio takes care of the persistent mode things mentioned above, as well as logging coverage.
This approach has been found to introduce an overhead of about 2x compared to the native execution speed, thus comparable to the original AFL in binary instrumentation mode.
Another feature by Dynamorio is fuzzing an already started process by attaching to it, although I don’t see the point of that.
Syzygy Instrumentation
While AFL supports instrumenting source code via custom gcc, WinAFL doesn’t really support fuzzing straight from source code.
It uses syzygy
along with full PDB symbols to instrument the already compiled binary using a technique called binary rewriting. The compiled binary is taken apart, analysed, “transformed” by adding coverage code snippets and finally pieced together.
Syzygy is not supported anymore, and it only works on 32-bit binaries, so it’s quite useless to be frank.
However, it does allow for a 2-4x speed increase when compared to Dynamorio.
Intel PT Blackbox
Instead of instrumenting the code to collect coverage, WinAFL’s Intel PT mode utilizes the Intel CPU’s trace ability to retrieve a trace of the executed instructions.
The trace is compressed, written to a user-specified-size buffer, and decoded by WinAFL to obtain coverage.
This is usually slower(https://github.com/googleprojectzero/winafl/issues/231) than Dynamorio, depending on the size of the trace buffer specified, and requires at least Windows 10 v1809 and a relatively new intel CPU. It also does not work against self modifying code.
One perk is that no modification/instrumentation is required to the code, and might work in the case where neither of the other options do.
Custom Functionalities
WinAFL is quite customizable(much more than AFL) and allows users to write their own processing DLLs.
Examples include custom mutators, network fuzzing capabilities and more.
Snippet below is part of a “server” dll to fuzz network clients.
1 | CUSTOM_SERVER_API int APIENTRY dll_run(char *data, long size, int fuzz_iterations) { |
By providing a custom implementation of the dll_init
and dll_run
functions, WinAFL can be transformed into other types of fuzzers.
Conclusion
AFL gives off the vibes of a true hacker tool. A bunch of simple yet well engineered hacks sticking together, that is extremely practical. Every idea/functionality is implemented in a way such that it provides a nudge to the final objective, but not forcing it(unlike certain academia tools that pride themselves on high accuracy but will take forever on a production software).
I believe AFL dominated the security space during its actively maintained periods(2017++). Although probably still very useful now, targets are getting more hardened and new practical research are popping up daily. A community maintained fork of AFL emerged since 2020, named AFL++
. We will analyse its improvements from AFL in the next post.
WinAFL
on the other hand sits somewhere between AFL and AFL++ on the features timeline. It provides certain enhancements to AFL, especially in terms of blackbox fuzzing, but assimilates AFL in its core designs. Unlike AFL which aims to be “simple, stupid”, I feel that WinAFL is more heavily dependent on the quality of the harness and various customizable options. Of course it can still function if you just toss a random binary at it, but that won’t come close to the effectiveness and extensibility that ProjectZero modelled it to be. WinAFL is however inferior on documentation, and you have to read the code to distinguish many of its subtle features.
It is worth noting that many many more forks of AFL exists, and it’s left to the reader to explore.
..