Fuzzing 2

2023-01-27

Feature review of AFL and WinAFL

Introduction

Since we’ve already talked about the types of fuzzing quite some time ago(Fuzzing 1), let’s dive straight into the overview of American Fuzzy Lop(AFL), a coverage-based evolutionary fuzzer.

In this post, we will go over the design inspiration of AFL, the inner workings of AFL, as well as WinAFL, a fork by ProjectZero to fuzz Windows binaries.

Inspiration

Before coverage-based evolutionary fuzzers were a thing in the late 2000s, researcher Tavis Ormandy(forgive me for making this sound like a history lesson… thought it’s quite disrespectful to leave out the name ;)) used gcov to measure coverage of various testcases to find the one with most coverage, then fed that as corpus to a dumb fuzzer.

This act netted him quite a number of bugs, and sparked the thought “what if we performed this at a regular interval during the fuzzing session such that the test cases generated always favoured coverage”.

However, creators soon realised that this approach of explicitly calculating coverage is resource intensive, which led to it looking good on paper but infeasible in practise.

AFL’s main design principal then became “Simplicity and Useability”. For example, it just favours testcases that do something new, instead of actually reasoning coverage.

Ad Verbatim:

In fact, at its core, it's designed to be just a very good traditional fuzzer with a wide range of interesting, well-researched strategies to go by. The fancy parts just help it focus the effort in places where it matters the most.

AFL Details

Coverage

Just like symbolic execution, coverage is one of, if not the most important feedback metric on the journey of program exploration.

AFL collects coverage by using a 64kb shared memory buffer. 64kb is chosen for efficient L2 cache access, and a decently low collision rate for most programs.

The pseudocode is like:

1
2
3

cur_location = <COMPILE_TIME_RANDOM>;
shared_mem[cur_location ^ prev_location]++;
prev_location = cur_location >> 1;

and is injected into every branch in the compiled source code, made possible by AFL’s custom implementation of compilers like gcc.

Every byte set in the output map can be thought of as a hit for a particular (branch_src, branch_dst) tuple in the instrumented code.

Why is prev_location divided by 2? This is to differentiate between paths of tight loops.

For example, a loop surrounding branch A, such that the path produced is A->A will be calculated as shared_mem[0], which is identical to that of path B->B if the division does not exist.

When a mutated input produces an execution trace containing new tuples, the corresponding input file is favoured, preserved and routed for additional processing.

If no new tuples are found, the input is discarded, even if the overall flow is unique.

For example, gievn A->B->C and C->A->B already favoured, A->B->E will be preserved while A->B->C->A will be discarded, even though its flow is unique.

The fuzzer also considers coarse tuple hit counts, divided into several buckets:
1, 2, 3, 4-7, 8-15, 16-31, 32-127, 128+

Changes within the range of a single bucket are ignored(For example a loop cycling 40 times vs cycling 43 times); transition from one bucket to another is flagged as an interesting change in program control flow(a piece of code that was running once now runs twice, amazing!), and is favoured for evolution.

Evolution

These favoured test cases are added back into the input queue, and used as a corpus for next rounds of fuzzing.

AFL does not prioritise these newly added corpuses over the current queue of corpuses, and simply queues them behind the current corpuses. This design is proven more effective empirically.

If you imagine it in action, it’s similar to an ever shrinking circle, instead of a directed “spear” that pokes everywhere if we had prioritised these new mutations.

Among the queue of test cases, AFL periodically re-evaluates the queue using a fast algorithm that selects a smaller subset of test cases that still cover every tuple seen so far. It works by looping through the tuple, find the testcases that covers a tuple not in the small subset yet, select the smallest(file) and fastest(execution latency) testcase, then register all tuples that this testcase exercises into the subset. It repeats this process again until all tuples are iterated.

The generated corpus of “favored” entries is usually 5-10x smaller than the starting data set. Non-favored entries are not discarded, but they are skipped with varying probabilities when encountered in the queue.

- If there are new, yet-to-be-fuzzed favorites present in the queue, 99% of non-favored entries will be skipped to get to the favored ones.

- If there are no new favorites:

  - If the current non-favored entry was fuzzed before, it will be skipped 95% of the time.

  - If it hasn't gone through any fuzzing rounds yet, the odds of skipping drop down to 75%.

A snippet:

if (pending_favored) {

    /* If we have any favored, non-fuzzed new arrivals in the queue,
       possibly skip to them at the expense of already-fuzzed or non-favored
       cases. */

    if ((queue_cur->was_fuzzed || !queue_cur->favored) &&
        UR(100) < SKIP_TO_NEW_PROB) return 1;

  } else if (!dumb_mode && !queue_cur->favored && queued_paths > 10) {

    /* Otherwise, still possibly skip non-favored cases, albeit less often.
       The odds of skipping stuff are higher for already-fuzzed inputs and
       lower for never-fuzzed entries. */

    if (queue_cycle > 1 && !queue_cur->was_fuzzed) {

      if (UR(100) < SKIP_NFAV_NEW_PROB) return 1;

    } else {

      if (UR(100) < SKIP_NFAV_OLD_PROB) return 1;

    }

  }

This adds some brains to our shrinking circle, removing some redundancy and prioritises shrinking of certain parts for efficiency. Since it does not actually discard any testcase, accuracy is not sacrificed in the long run.

It also employs a minimization algorithm to actively try and reduce the size of the input testcase.

The actual minimization algorithm is:

  1) Attempt to zero large blocks of data with large stepovers. Empirically, this is shown to reduce the number of execs by preempting finer-grained efforts later on.

  2) Perform a block deletion pass with decreasing block sizes and stepovers, binary-search-style. 

  3) Perform alphabet normalization by counting unique characters and trying to bulk-replace each with a zero value.

  4) As a last result, perform byte-by-byte normalization on non-zero bytes.

Instead of zeroing with a 0x00 byte, afl-tmin uses the ASCII digit '0'. This is done because such a modification is much less likely to interfere with text parsing, so it is more likely to result in successful minimization of text files.

A snippet of it:

/* Select initial chunk len, starting with large steps. */

  len_p2 = next_p2(q->len);

  remove_len = MAX(len_p2 / TRIM_START_STEPS, TRIM_MIN_BYTES);

  /* Continue until the number of steps gets too high or the stepover
     gets too small. */

  while (remove_len >= MAX(len_p2 / TRIM_END_STEPS, TRIM_MIN_BYTES)) {

    u32 remove_pos = remove_len;

    sprintf(tmp, "trim %s/%s", DI(remove_len), DI(remove_len));

    stage_cur = 0;
    stage_max = q->len / remove_len;

    while (remove_pos < q->len) {

      u32 trim_avail = MIN(remove_len, q->len - remove_pos);
      u32 cksum;

      write_with_gap(in_buf, q->len, remove_pos, trim_avail);

      fault = run_target(argv, exec_tmout);
      trim_execs++;

      if (stop_soon || fault == FAULT_ERROR) goto abort_trimming;

      /* Note that we don't keep track of crashes or hangs here; maybe TODO? */

      cksum = hash32(trace_bits, MAP_SIZE, HASH_CONST);

      /* If the deletion had no impact on the trace, make it permanent. This
         isn't perfect for variable-path inputs, but we're just making a
         best-effort pass, so it's not a big deal if we end up with false
         negatives every now and then. */

This algorithm is by no means elegant on paper, but it’s definitely effective and has a low overhead. For example, a bufferoverflow testcase that was initially abcdefghijklmnopqrs can be minimized to 0000000000000000000, clearly revealing that specific characters are not required to trigger that path.

Mutation

The original AFL engine consists of 2 stages of mutation.

The first stage is highly deterministic, consisting of sequential bit flips, byte flips, addition/subtraction of integers and insertion of interesting values(such as -1, max integer etc). When changes to a certain region of the testcase does not trigger a new tuple, they may be excluded in the following deterministic runs for efficiency.

  /* We also use this stage to pull off a simple trick: we identify
     bytes that seem to have no effect on the current execution path
     even when fully flipped - and we skip them during more expensive
     deterministic stages, such as arithmetics or known ints. */

 ...

if (len < 2) goto skip_bitflip;

The reason for this stage is to try and get a small difference between the initial corpus and crashing input so it’s easier for the user to reason about the final crashing testcase.

The second stage consists of stacked, random operations mentioned above, as well as splicing of 2 testcases.

Dictionary

AFL supports the use of dictionaries to aid in fuzzing.

An example PNG dictionary looks like:

header_png="\x89PNG\x0d\x0a\x1a\x0a"

section_IDAT="IDAT"
section_IEND="IEND"
section_IHDR="IHDR"

<more keywords>

These keywords specified will be used by AFL when mutating testcases, in order to formulate correct header bytes efficiently to pass the format checks.

if (!extras_cnt) goto skip_user_extras;

  /* Overwrite with user-supplied extras. */

  stage_name  = "user extras (over)";
  stage_short = "ext_UO";
  stage_cur   = 0;
  stage_max   = extras_cnt * len;

  stage_val_type = STAGE_VAL_NONE;

  orig_hit_cnt = new_hit_cnt;

  for (i = 0; i < len; i++) {

    u32 last_len = 0;

    stage_cur_byte = i;

    /* Extras are sorted by size, from smallest to largest. This means
       that we don't have to worry about restoring the buffer in
       between writes at a particular offset determined by the outer
       loop. */

    for (j = 0; j < extras_cnt; j++) {

      /* Skip extras probabilistically if extras_cnt > MAX_DET_EXTRAS. Also
         skip them if there's no room to insert the payload, if the token
         is redundant, or if its entire span has no bytes set in the effector
         map. */

      if ((extras_cnt > MAX_DET_EXTRAS && UR(extras_cnt) >= MAX_DET_EXTRAS) ||
          extras[j].len > len - i ||
          !memcmp(extras[j].data, out_buf + i, extras[j].len) ||
          !memchr(eff_map + EFF_APOS(i), 1, EFF_SPAN_ALEN(i, extras[j].len))) {

        stage_max--;
        continue;

      }

      last_len = extras[j].len;
      memcpy(out_buf + i, extras[j].data, last_len);

      if (common_fuzz_stuff(argv, out_buf, len)) goto abandon_entry;

      stage_cur++;

    }

This is quite necessary when reasoning about parsers with strict checksum, unless you want AFL to spend weeks discovering the correct format. As we will soon see in the next post, AFL++ improves this by utilizing REDQUEEN to efficiently overcome checksums and magic bytes without the use of symex.

Apart from allowing the user to customize a dictionary of keywords for the fuzzer to follow, AFL attempts to automatically build a dictionary by using logic to collect certain parts of the testcase that it deems as a “keyword”.

This blogpost by the creator of AFL explains it very well, so I’m just going to quote him.

https://lcamtuf.blogspot.com/2015/01/afl-fuzz-making-up-grammar-with.html

To explain the approach, it's useful to rely on the example of a PNG file. The PNG format uses four-byte, human-readable magic values to indicate the beginning of a section, say:

89 50 4e 47 0d 0a 1a 0a 00 00 00 0d 49 48 44 52 | .PNG........IHDR 00 00 00 20 00 00 00 20 02 03 00 00 00 0e 14 92 | ................

The algorithm in question can identify "IHDR" as a syntax token by piggybacking on top of the deterministic, sequential bit flips that are already being performed by afl-fuzz across the entire file. It works by identifying runs of bytes that satisfy a simple property: that flipping them triggers an execution path that is distinct from the product of flipping stuff in the neighboring regions, yet consistent across the entire sequence of bytes.

This signal strongly implies that touching any of the affected bytes causes the failure of an underlying atomic check, such as header magic_value == 0xDEADBEEF or strcmp(name, "Set-Cookie"). When such a behavior is detected, the entire blob of data is added to the dictionary, to be randomly recombined with other dictionary tokens later on.

Duplication Removal

AFL considers the crash as unique if the following two conditions are met:

1
2
3

- The crash trace includes a tuple not seen in any of the previous crashes,

- The crash trace is missing a tuple that was always present in earlier faults.

A -C flag passed to AFL can be used to explore crashes. AFL mutates these crashes and attempts to find results that are clearly controllable from the input. For example, a segfault at an address corresponding to the testcase’s data will probably signify an arbitrary jump, which is surely something to note.

if (fault == crash_mode) {

    /* Keep only if there are new bits in the map, add to queue for
       future fuzzing, etc. */

    if (!(hnb = has_new_bits(virgin_bits))) {
      if (crash_mode) total_crashes++;
      return 0;
    }

This however adds workload that might be unnecessary to the queue, so I don’t recommend using it. Exploration can be done separately after the crash cases are reported.

Forkserver and Persistent Mode

AFL supports “forkserver mode” to improve efficiency.

Essentially, it does nothing until the loader does all its work and the program is at its entry point. Then, it forks the current image and runs the fuzz on the forked process, such that just one round of loading is sufficient for the entire fuzzing session, and skips a lot of libc linking overhead.

It works by injecting a shim into the program being fuzzed. The shim is injected right before the program’s entry point, and opens a pipe to communicate with AFL. Once the fork instruction is received, it forks and resumes execution of the binary in the child process, while the parent monitors it.

__afl_fork_wait_loop:

  /* Wait for parent by reading from the pipe. This will block until
     the parent sends us something. Abort if read fails. */

  pushl $4          /* length    */
  pushl $__afl_temp /* data      */
  pushl $198        /* file desc */
  call  read
  addl  $12, %esp

  cmpl  $4, %eax
  jne   __afl_die

  /* Once woken up, create a clone of our process. */

  call fork

  cmpl $0, %eax
  jl   __afl_die
  je   __afl_fork_resume

  /* In parent process: write PID to pipe, then wait for child. 
     Parent will handle timeouts and SIGKILL the child as needed. */

  movl  %eax, __afl_fork_pid

  pushl $4              /* length    */
  pushl $__afl_fork_pid /* data      */
  pushl $199            /* file desc */
  call  write
  addl  $12, %esp

  pushl $2             /* WUNTRACED */
  pushl $__afl_temp    /* status    */
  pushl __afl_fork_pid /* PID       */
  call  waitpid
  addl  $12, %esp

  cmpl  $0, %eax
  jle   __afl_die

  /* Relay wait status to pipe, then loop back. */

  pushl $4          /* length    */
  pushl $__afl_temp /* data      */
  pushl $199        /* file desc */
  call  write
  addl  $12, %esp

  jmp __afl_fork_wait_loop

__afl_fork_resume:

  /* In child process: close fds, resume execution. */

  pushl $198
  call  close

  pushl $199
  call  close

You may ask, how is the fuzz payloads delivered? They are delivered via a file(since forked processes share fds). Each new forked process will rewind the fd, so it accesses the new fuzz payloads delivered.

/* Isolate the process and configure standard descriptors. If out_file is
       specified, stdin is /dev/null; otherwise, out_fd is cloned instead. */

    setsid();

    dup2(dev_null_fd, 1);
    dup2(dev_null_fd, 2);

    if (out_file) {

      dup2(dev_null_fd, 0);

    } else {

      dup2(out_fd, 0);
      close(out_fd);

...

/* Write modified data to file for testing. If out_file is set, the old file
   is unlinked and a new one is created. Otherwise, out_fd is rewound and
   truncated. */

static void write_to_testcase(void* mem, u32 len) {

  s32 fd = out_fd;

  if (out_file) {

    unlink(out_file); /* Ignore errors. */

    fd = open(out_file, O_WRONLY | O_CREAT | O_EXCL, 0600);

    if (fd < 0) PFATAL("Unable to create '%s'", out_file);

  } else lseek(fd, 0, SEEK_SET);

  ck_write(fd, mem, len, out_file);

  if (!out_file) {

    if (ftruncate(fd, len)) PFATAL("ftruncate() failed");
    lseek(fd, 0, SEEK_SET);

  } else close(fd);

}

With fast targets, the fork server can offer considerable performance gains, usually between 1.5x and 2x.

Persistent mode is just forkserver mode, but each process will be fed multiple testcases before it is terminated.

Binary Only Mode

In cases where source code is not available, AFL provides a binary only mode.

1	qemu_mode ? "qemu " : "", dumb_mode ? " dumb " : "",

which… quite obviously has significant overheads.

Instead of instrumenting native code, AFL will first use QEMU’s “user emulation” to interpret native code, while AFL instruments QEMU’s intermediate representation of the code.

The QEMU process is initialized then forked, in order to reduce startup overhead(quite massive). Another overhead is the translation of new(never seen before) blocks by QEMU. AFL adds the translated code to a cache, which is shared with future QEMU clones to reduce new translation attempts.

As a result of these two optimizations, the overhead of the QEMU mode is roughly 2-5x.

WinAFL

WinAFL is a fork of AFL by ProjectZero.

The original AFL project was quite *nix-centred, and can’t be used to test Windows binaries.

WinAFL modifies and extends AFL to make it Windows specific.

Persistent Mode

WinAFL does not support forkserver mode, since Windows does not provide a “fork” system service natively.

Instead, it relies heavily on Persistent Mode for speed improvements. Heavy initialization code should be performed in the main function of the harness, then a separate fuzz function is created to invoke the target. This fuzz function’s offset is fed to WinAFL, which iterates through this function for x number of times.

WinAFL is designed to run 10+ to 5000(user specified) different testcases in a single process in order to match the efficiency of forkserver mode in AFL.

That also means the harness needs to be clean and accurate, so not too much initialization and cleanup is done in the fuzz function, yet the target’s states stay generally clean in each iteration to not interfere with the new payload.

Shared Memory Payload Delivery

WinAFL adds the ability to share the fuzzing payload to the target program using shared memory, instead of from a file as we’ve discussed before with AFL.

#define MAX_SAMPLE_SIZE 1000000

shm_str = (char *)alloc_printf("afl_shm_%s", fuzzer_id);

    shm_handle = CreateFileMapping(
                   INVALID_HANDLE_VALUE,    // use paging file
                   &sa,                     // allow access to everyone
                   PAGE_READWRITE,          // read/write access
                   0,                       // maximum object size (high-order DWORD)
                   MAP_SIZE,                // maximum object size (low-order DWORD)
                   (char *)shm_str);        // name of mapping object

...

} else if (use_sample_shared_memory) {        
      //this writes fuzzed data to shared memory, so that it is available to harnes program.
      uint32_t* size_ptr = (uint32_t*)shm_sample;
      unsigned char* data_ptr = shm_sample + 4;

First 4 bytes are size, and the following bytes are payload data.

The shared memory’s name is passed to our target program to replace the initial file name arguments.

if (!use_sample_shared_memory) {
 out_file = alloc_printf("%s\\.cur_input", out_dir);
} else {
   //this sets output file as shared memory name which is used by harness program.
   out_file = sample_shm_str;
}

This requires us to code a harness that explicitly reads in data from a shared memory. The name of the shared memory will be passed in via AFL command line, replacing the filename argument @@.

Shared memory is faster and can avoid some problems with files (e.g. unable to overwrite the sample file because a target maintains a lock on it). It also reduces stress on disk.

Dynamorio Blackbox

WinAFL uses Dynamorio to perform dynamic instrumentation(like QEMU mode in AFL).

Dynamorio takes care of the persistent mode things mentioned above, as well as logging coverage.

This approach has been found to introduce an overhead of about 2x compared to the native execution speed, thus comparable to the original AFL in binary instrumentation mode.

Another feature by Dynamorio is fuzzing an already started process by attaching to it, although I don’t see the point of that.

Syzygy Instrumentation

While AFL supports instrumenting source code via custom gcc, WinAFL doesn’t really support fuzzing straight from source code.

It uses syzygy along with full PDB symbols to instrument the already compiled binary using a technique called binary rewriting. The compiled binary is taken apart, analysed, “transformed” by adding coverage code snippets and finally pieced together.

Syzygy is not supported anymore, and it only works on 32-bit binaries, so it’s quite useless to be frank.

However, it does allow for a 2-4x speed increase when compared to Dynamorio.

Intel PT Blackbox

Instead of instrumenting the code to collect coverage, WinAFL’s Intel PT mode utilizes the Intel CPU’s trace ability to retrieve a trace of the executed instructions.

The trace is compressed, written to a user-specified-size buffer, and decoded by WinAFL to obtain coverage.

This is usually slower(https://github.com/googleprojectzero/winafl/issues/231) than Dynamorio, depending on the size of the trace buffer specified, and requires at least Windows 10 v1809 and a relatively new intel CPU. It also does not work against self modifying code.

One perk is that no modification/instrumentation is required to the code, and might work in the case where neither of the other options do.

Custom Functionalities

WinAFL is quite customizable(much more than AFL) and allows users to write their own processing DLLs.

Examples include custom mutators, network fuzzing capabilities and more.

Snippet below is part of a “server” dll to fuzz network clients.

CUSTOM_SERVER_API int APIENTRY dll_run(char *data, long size, int fuzz_iterations) {
	DWORD dwThreadId;
	test_case_struct *test_case = (test_case_struct *)malloc(sizeof(test_case_struct));
	test_case->data = (char *)malloc(size);

	memcpy(test_case->data, data, size);
	test_case->size = size;

	/* we have to create a second thread to avoid blocking winAFL in recv */
	if (hr != NULL)
		WaitForSingleObject(hr, INFINITE); /* we have to wait our previous thread to finish exec */
	hr = CreateThread(NULL, 0, handle_incoming_connection, (LPVOID)test_case, 0, &dwThreadId);
	if (hr == NULL)
		return 0;

	return 1;
}

/* winAFL is a TCP server now (TODO: implement UDP server) */
CUSTOM_SERVER_API int APIENTRY dll_init() {
	static WSADATA wsaData;
	static int iResult;
    s32 opt;
	static struct addrinfo *result = NULL;
	static struct addrinfo hints;
	static int iSendResult;

	static int first_time = 0x1;
    unsigned char *server_bind_port = NULL;

By providing a custom implementation of the dll_init and dll_run functions, WinAFL can be transformed into other types of fuzzers.

Conclusion

AFL gives off the vibes of a true hacker tool. A bunch of simple yet well engineered hacks sticking together, that is extremely practical. Every idea/functionality is implemented in a way such that it provides a nudge to the final objective, but not forcing it(unlike certain academia tools that pride themselves on high accuracy but will take forever on a production software).

I believe AFL dominated the security space during its actively maintained periods(2017++). Although probably still very useful now, targets are getting more hardened and new practical research are popping up daily. A community maintained fork of AFL emerged since 2020, named AFL++. We will analyse its improvements from AFL in the next post.

WinAFL on the other hand sits somewhere between AFL and AFL++ on the features timeline. It provides certain enhancements to AFL, especially in terms of blackbox fuzzing, but assimilates AFL in its core designs. Unlike AFL which aims to be “simple, stupid”, I feel that WinAFL is more heavily dependent on the quality of the harness and various customizable options. Of course it can still function if you just toss a random binary at it, but that won’t come close to the effectiveness and extensibility that ProjectZero modelled it to be. WinAFL is however inferior on documentation, and you have to read the code to distinguish many of its subtle features.

It is worth noting that many many more forks of AFL exists, and it’s left to the reader to explore.