Fuzzing 6

2023-12-22

Writing a static binary instrumentation engine

Introduction

As mentioned in previous posts, the theme of 2023 for this blog is fuzzing. We’ve looked at fuzzing papers, tools, source code, application. What better way to finish the year than to write our own fuzzer? If you’ve read Fuzzing 5, it should be clear that the core of a smart fuzzer is its instrumentation engine. This post will detail the steps and logic to write your own SBI engine from first principals. The end product will be a fast and dirty tool that works with WinAFL out of the box, supports kernel and usermode fuzzing, supports x64, and has minimal overhead.

Why

I realised that none of the open source fuzzers existing could fuzz the Windows kernel with ease.
Hypervisor based fuzzers like wtf or the underlying WHv APIs don’t support device emulation, which meant that you have to write a bunch of hooks to implement something like special pool, which is absolutely vital during fuzzing. Fuzzers that do support full device emulation such as kAFL runs on bare-metal linux, is extremely tedious to get working for Windows, and is very slow compared to native execution.
DBI is out of the question because trying to rewrite a loaded driver’s memory is going to give a lot of problems(especially when other processes are interacting with it), affects performance severely(think of all the hooks if you’re doing a hook based implementation), and provides no benefits compared to SBI. Usermode fuzzers probably use DBI only because robust frameworks like Dynamorio already exist for other purposes, and can be repurposed for fuzzing.
Hardware based solutions like IntelPT doesn’t provide edge coverage out of the box and thus require additional parsing of the instruction trace. Furthermore if the target performs DPCs or other callbacks that can run in arbitrary thread contexts, we can’t do process/thread based filtering and will have to parse a huge instruction buffer generated by all processes.

This leaves us with SBI :)

Existing Solutions

Perhaps the most well known SBI implementation for Windows kernel is pe-afl by Lucas Leong. Unfortunately this tool only supports 32-bit programs, and is pretty much unusable on modern Windows.
There exists a 64-bit patch peafl64 by Sentinel One. Maybe it works, maybe it doesn’t, but it didn’t work for me and I hate debugging other people’s code, and I suspect that it is overly complicated(sorry Sentinel One). For small projects I prefer writing the main logic from scratch so it’ll be easier to modify in the future. That being said I did use bits and pieces of code from peafl64, especially the ida_dumper.py script and pefile.py which are well made.

First Principals

Let’s go back to the dogma of AFL, which can be described in 3 lines of code.

1
2
3

cur_location = <COMPILE_TIME_RANDOM>;
shared_mem[cur_location ^ prev_location]++; 
prev_location = cur_location >> 1;

The rationale behind this snippet has been explained in Fuzzing 2. All that’s important is this snippet logs edge coverage to shared_mem, which AFL can consume to make mutation decisions. The original AFL modifies a compiler to inject this snippet to every branch target during compile time.

Since we don’t have source code, we’ll need to inject this snippet to every branch target after compilation is done. This can be further split into tasks such as finding all branch targets(basic blocks), expanding those basic blocks to make space for instrumentation shellcode, fixing offsets of instructions and so on. This is tougher said than done, since we have IDA to provide us with relevant offsets. But first let’s convert the above pseudocode into shellcode.

Shellcode

Original AFL allocates a 0x10000 sized buffer at the .cov section, and the coverage is logged there.

However WinAFL already pre-allocates a shared memory buffer for coverage, and the harness is supposed to map the buffer into its process, then copy the coverage bytes over.

shm_handle = CreateFileMapping(
                   INVALID_HANDLE_VALUE,    // use paging file
                   &sa,                     // allow access to everyone
                   PAGE_READWRITE,          // read/write access
                   0,                       // maximum object size (high-order DWORD)
                   MAP_SIZE,                // maximum object size (low-order DWORD)
                   (char *)shm_str);        // name of mapping object

if(shm_handle == NULL) {
    if(sync_id) {
    PFATAL("CreateFileMapping failed (check slave id)");
}

trace_bits = (u8 *)MapViewOfFile(
    shm_handle,          // handle to map object
    FILE_MAP_ALL_ACCESS, // read/write permission
    0,
    0,
    MAP_SIZE
  );

if (!trace_bits) PFATAL("shmat() failed");

As per original pe-afl implementation:

1	memcpy(shared_mem, COV_ADDR, MAP_SIZE);

I thought this was rather redundant, and the copy contributes to 4% of one cycle’s execution time when tested against a fast processing binary.

Instead, my .cov section only contains a pointer to the shared memory section, which will be filled by the harness. If the pointer is not present, the shellcode will terminate and not log coverage. Otherwise it will log coverage directly to shared memory.

# Shellcode that only updates coverage bitmap if the current pid matches the harness's pid

asm_filter = sc_prefix + f'''
mov rax, QWORD PTR gs:[188h]                 # PsGetCurrentProcessId()
mov rax, QWORD PTR [rax+4C8h]
cmp rax, QWORD PTR [rip+{hex(M_PID)}]        # pid @ .cov+0x20
jne skip
mov rax, QWORD PTR [rip+{hex(M_PREV_LOC)}]   # __afl_prev_loc @ .cov+0x10
mov rbx, {hex(C_ADDR1)}
xor rbx, rax
mov rax, QWORD PTR [rip+{hex(M_AREA_PTR)}]   # __afl_area_ptr @ .cov
test rax, rax
jz skip
add rax, rbx                
inc BYTE PTR [rax]              
mov rax, {hex(C_ADDR2)}
mov QWORD PTR [rip+{hex(M_PREV_LOC)}], rax
skip:
''' + sc_suffix

For kernelmode targets we’ll use a driver to map the coverage buffer into nonpaged memory, before writing it to .cov section. This is because the shellcode may be executed while its IRQL >= 2, and we can’t afford a page fault.

With the shellcode ready, let’s see how we can extract crucial information from the binary using IDA.

ida_dumper.py

As mentioned above we’re interested in information such as basic-block(BB) addresses, instructions with relative/rip-relative offsets etc.

The original script by Sentinel One is quite clear, and I’ve only made slight changes to it. One of the changes is collecting targets of jmp addresses as basic blocks.

# skip calls
if mnem in ["call"]:
    return
    
# for jmp, just get one target
if mnem == "jmp":
    g_basic_blocks.append(operand) # target of jmp
else:
    # identify as basic block, jxx/loop true/false target
    g_basic_blocks.append(idc.next_head(ea)) # inverse target of jxx
    g_basic_blocks.append(operand) # target of jxx

Sentinel One researchers feel that unconditional jumps should not be regarded as a change in basic block. However I’ll need this information when expanding short jumps to near jumps, as explained in a later section. This does not affect coverage at all so it’s fine.

In the end we’ll have the following information from the binary:

basic blocks start
functions start and end
relative instructions
rip-relative instructions
exception handler addresses

These information are enough to instrument most normal binaries.

Flow

orig_binary_path = sys.argv[1]
ida_data_path = sys.argv[2]

inst_binary_path = engine.duplicate_binary(orig_binary_path)

engine.load_binary(inst_binary_path)
engine.load_ida_data(ida_data_path)

engine.try_clear_overlay()
engine.make_new_segments()
engine.inject_into_bb(asm_stubs.asm(asm_stubs.asm_all))
engine.fix_jumps()
engine.write_bb()

engine.fix_exports()

orig_main_rva = 0x137d8
rvas_to_update = [0x18ec8]
engine.fix_entrypoint(orig_main_rva, rvas_to_update)

engine.fix_exceptions()

engine.fix_cfg()

engine.fix_checksum()
engine.commit_binary()

If we expand the basic blocks in place, we’ll have a lot more trouble to deal with, such as the relocation table and import table. An easier way is to leave the original sections alone and clone all executable sections to the end of the binary, after all the data sections. This way we can expand the cloned sections as large as we want, while all references to data sections are still valid.

Cloning Sections

1
2
3

for orig_sec in g_binary.sections.copy():
    if IS_EXECUTABLE(orig_sec) and orig_sec.SizeOfRawData != 0:
        duplicate_segment(orig_sec, ENLARGE_MULTIPLE)

For all executable sections, we’ll make a new section at the end of the binary with the same properties, but a larger size. This is to make space for all the instrumentation shellcode about to be added.

# Get some space for CFG table
rdata_seg = GET_SEC_BY_NAME(b".rdata")
duplicate_segment(rdata_seg, name=b".cfg", size=0x1000)

# Use data segment as template for .cov segment
data_seg = GET_SEC_BY_NAME(b".data")
duplicate_segment(data_seg, name=b".cov", size=0x1000)

We also make a .cfg section to house a new CFG table(because we will add more functions to it and the original one is too small), as well as a .cov section.

If the binary has too many executable sections, the original section table may not be able to hold the information, and it will overflow into the first section of the binary.
We should be aware of this and save some bytes before cloning the sections:

# first save some bytes from the first section, in case section table overwrites into it
g_binary.sections.sort(key=lambda s: s.PointerToRawData)
g_first_section_raw = g_binary.sections[0].PointerToRawData
g_saved_bytes = g_binary.__data__[g_first_section_raw:g_first_section_raw+SAVE_BYTES_SIZE]

Injecting Instrumentation

First challenge is determining the size of a BB, since IDA doesn’t tell us about it. I take the size as the delta from one BB to the next, except for the last BB of the section.

if bb_idx + 1 == len(g_ida_data["bb"]):
    # for last bb, take length as up to end of function
    for start, end in g_ida_data["functions"].items():
        if bb_raw < start or bb_raw > end:
            continue
        bb_size = end - bb_raw
else:
    # for all other bbs, take length as delta from next bb
    # unless they are not in the same section, then take length as delta from end of function
    if GET_SEC_BY_ADDR(g_ida_data["bb"][bb_idx]) != GET_SEC_BY_ADDR(g_ida_data["bb"][bb_idx+1]):
        for start, end in g_ida_data["functions"].items():
            if bb_raw < start or bb_raw > end:
                continue
            bb_size = end - bb_raw
    else:
        bb_size = g_ida_data["bb"][bb_idx+1] - bb_raw

Then we figure out where to slot the instrumented BBs.

orig_section = GET_SEC_BY_ADDR(self.orig_rva)
self.orig_raw = GET_RAW_BY_RVA(self.orig_rva)
if g_current_section != orig_section:
    new_sec_name = g_section_mapping[orig_section.Name]
    new_sec = GET_SEC_BY_NAME(new_sec_name)
    g_current_section = orig_section
    g_current_available_raw = new_sec.PointerToRawData
    g_current_available_rva = new_sec.VirtualAddress
        
self.new_raw = g_current_available_raw
self.new_rva = g_current_available_rva

For BBs in the same section we simply stack them contiguously in memory.
One problem is that valid Control Flow Guard(CFG) targets should have a 16-byte alignment, otherwise the loader will reject it. We deal with this by padding every BB with nops after instrumentation.

1	self.data += b"\x90"*(0x10-len(self.data)%0x10) # 16 byte align

First up is replacing the magic bytes in shellcode.

# replace magic in shellcode
cov_area = GET_SEC_BY_NAME(b".cov").VirtualAddress
offset = self.new_rva + shellcode.find(p32(asm_stubs.M_AREA_PTR)) + 4
offset = cov_area - offset
shellcode = shellcode.replace(p32(asm_stubs.M_AREA_PTR), p32(offset))

The shellcode uses rip-relative instructions to retrieve coverage pointers, which means we don’t have to deal with the .reloc table(for hardcoded absolute addresses) and can simply replace the address part.

Then we enumerate all instructions residing in the current BB.

# get all relatives and rip_relatives in current BB
for ins_loc, ins_metadata in g_ida_data["relative"].items():
    if ins_loc < self.orig_rva or ins_loc >= self.orig_rva + self.orig_size:
        continue
    all_ins_list.append(INS_ID("relative", ins_loc, ins_metadata))

for ins_loc, ins_metadata in g_ida_data["rip_relative"].items():
    if ins_loc < self.orig_rva or ins_loc >= self.orig_rva + self.orig_size:
        continue
    all_ins_list.append(INS_ID("rip_relative", ins_loc, ins_metadata))

We can’t actually fix their destination now, because the size and offsets of other BBs are still not determined. Instead we just expand their sizes and pad the destinations with dummy values. This is just to calculate a final size for the BB after instrumentation.

Instructions that require expansion are those that take in a one byte offset as destination(for example JE rel8). These instructions only support referencing an offset of -127/+128, which may not be enough after injecting instrumentation.

A part of it:

if ins_operator > 0x6f and ins_operator < 0x80:
    # JE/JNE/J0... class
    new_operator = pb16(ins_operator + 0x0f10) # JE(74) rel8 -> JE(ff 84) rel32

elif ins_operator == 0xE3:
    # JCXZ cb -> test ecx, ecx; JZ rel32
    new_operator = pb16(0x85c9) # test ecx, ecx
    new_operator += pb16(0x0f84) # JZ rel32

elif ins_operator == 0xEB:
    # JMP rel8 -> JMP rel32
    new_operator = pb8(0xE9)

# loop instructions are dumb, we have to expand to test and jump
elif ins_operator == 0xE0:
    test_operator = pb16(0x85c9) # test ecx, ecx
    test_operator += pb16(0x0f85) # JNZ rel32
    # LOOPNZ rel8 -> JZ (after test); test ecx, ecx; JNZ rel32
    new_operator = pb8(0x74)
    new_operator += pb8(len(test_operator) + len(DUMMY_ADDR))
    new_operator += test_operator

...

We collect the destination of jmp instructions with ida_dumper.py so we can expand them as well.

The same is done for rip-relative instructions, but those are much easier since we just need to replace the destination bytes and don’t have to expand their sizes.

Since instructions are expanded in place, we’ll need to keep track of expanded sizes so we can correctly retrieve the next instruction.

1	ins_offset = increased_size + (ins.ins_loc - self.orig_rva)

The end product is a list of INS objects that describe an instruction.

class INS:
    def __init__(self, old_dest, ins_offset, bb_addr_offset, new_size):
        self.old_dest = old_dest
        self.ins_offset = ins_offset
        self.bb_addr_offset = bb_addr_offset
        self.new_size = new_size

Every BB will store such a list, and these instructions will be fixed after all BBs are expanded as above.

for bb in g_bb_list:
    # fix both relative and rip_relative the same way
    for ins in bb.relative_ins + bb.rip_relative_ins:
        new_addr = bb_get_new_addr_from_old(ins.old_dest)
        # write into bb's data
        ins_abs_offset = ins.ins_offset + bb.new_rva
        bb.data[ins.bb_addr_offset:ins.bb_addr_offset+4] = p32((((new_addr-ins_abs_offset)&0xffffffff)-ins.new_size)&0xffffffff)

Finally we write the BBs into their relative sections in the binary.

for bb_idx in range(len(g_bb_list)):
    if bb_idx % 1000 == 0:
    print(f"[*] Writing {bb_idx}/{len(g_bb_list)}")
    g_binary.__data__[g_bb_list[bb_idx].new_raw:g_bb_list[bb_idx].new_raw+len(g_bb_list[bb_idx].data)] = g_bb_list[bb_idx].data

Let’s take cldflt.sys as an example.

Before instrumentation:

Instrumenting:

After instrumentation:

All code remains the same, just instrumentation added before every basic block.

Fix Entrypoint

This is the most important step.
If the entrypoint is not patched, it will call the uninstrumented main instead, and all function references(IRP handlers/callbacks) will point to the uninstrumented section.

Since we’ve collected the destination of jump targets, jmp instructions will be patched as well and we can safely update the entrypoint pointer in the PE’s optional header.

If you don’t treat jumps as a change of basic block, the mainCRTStartup function in CRT linked executables will end up calling the uninstrumented __scrt_common_main_seh, which will reference the uninstrumented main function.

We can patch that by replacing all references to the main function with the instrumented main.

new_main_rva = bb_get_new_addr_from_old(orig_main_rva)
for rva in rvas_to_update:
    new_delta = (((new_main_rva-rva)&0xffffffff)-5)&0xffffffff # 5 -> size of call instruction
    update_point = GET_RAW_BY_RVA(rva + 1) # skip the E8 call instruction
    WRITE_RAW_DWORD(update_point, new_delta)

Fix Exception Handlers

Exceptions are quite pesky on Windows, especially when you have code written in C++ and C.

For starters we have the __C_specific_handler for __try statements. Then we have __GSHandlerCheck for functions with stack canaries enabled. Then we have __GSHandlerCheck_SEH for __try statements within functions with stack canaries. Then there is __CxxFrameHandler3 and __CxxFrameHandler4 for C++ try statements. Finally there is __GSHandlerCheck_EH for C++ try statements withint functions with stack canaries.

All of these have custom implementation formats that require custom patching to work, otherwise you’ll get false positives like a crash that should be handled by an exception handler.

Good news is kernelmode does not support C++ exception handling, and C++ style try will be compiled as __C_specific_handler. So we only really have 3 handlers to deal with: __C_specific_handler, __GSHandlerCheck and __GSHandlerCheck_SEH.

The exception table looks like this:

BeginAddress and EndAddress specifies the start and end of a function that requires exception handling. UnwindInfoAddress points to an UNWIND_INFO struct.

typedef struct _UNWIND_INFO {
 unsigned char Version : 3; // Version Number
 unsigned char Flags : 5; // Flags
 unsigned char SizeOfProlog;
 unsigned char CountOfCodes;
 unsigned FrameRegister : 4;
 unsigned FrameOffset : 4;
 UNWIND_CODE UnwindCode[1];
/* UNWIND_CODE MoreUnwindCode[((CountOfCodes+1)&~1)-1];
* union {
* OPTIONAL ULONG ExceptionHandler;
* OPTIONAL ULONG FunctionEntry;
* };
* OPTIONAL ULONG ExceptionData[];
*/
} UNWIND_INFO, *PUNWIND_INFO;

If the flags contain UNW_FLAG_EHANDLER or UNW_FLAG_UHANDLER, then the ExceptionHandler and ExceptionData member is present.
If the flags contain UNW_FLAG_CHAININFO, the BeginAddress EndAddress and UnwindInfoAddress of a previous unwind info structure is stored after UnwindCode instead.
UnwindCode is always a multiple of 2.

For __C_specific_handler, the ExceptionData contains a SCOPE_TABLE structure:

typedef struct _SCOPE_TABLE {
     ULONG Count;
     struct
     {
         ULONG BeginAddress;
         ULONG EndAddress;
         ULONG HandlerAddress;
         ULONG JumpTarget;
     } ScopeRecord[1];
 } SCOPE_TABLE, *PSCOPE_TABLE;

This BeginAddress and EndAddress specifies the part of code inside the __try block. The HandlerAddress points to the function within the __except() parenthesis, or just 1 if EXCEPTION_EXECUTE_HANDLER is specified. It also points to the code inside the __finally block if that exists. JumpTarget points to code inside the __except block if it exists, otherwise it’s 0.

__GSHandlerCheck contains some GS specific data, which does not require patching.

__GSHandlerCheck_SEH contains a SCOPE_TABLE followed by GS specific data so we just need to patch the scopetable. (Edit: Looking back at it, we do need to patch GS data if it’s together with SEH, otherwise some deterministic crash will happen in RtlUnwindEx. But I don’t care if the stack canary is corrupted or the return pointer is corrupted. Both will reflect as a crash, and both are considered bugs to me. So I just forcefully replace all __GSHandlerCheck_SEH with __C_specific_handler.)

while start_raw < end_raw:
    unwind_raw = add_exception_begin_end(start_raw)
    unwind_flag = READ_RAW_BYTE(unwind_raw)
    unwind_code_count = READ_RAW_BYTE(unwind_raw + 2)

    while unwind_flag & UNW_FLAG_CHAININFO:
        # chained unwind info
        chained_unwind_raw_start = unwind_raw + 4 + ((unwind_code_count + 1) &~ 1) * 2
        unwind_raw = add_exception_begin_end(chained_unwind_raw_start)
        unwind_flag = READ_RAW_BYTE(unwind_raw)
        unwind_code_count = READ_RAW_BYTE(unwind_raw + 2)

    # regular unwind info
    # for our case it's not necessary to differentiate between exception and termination handler
    if  (unwind_flag >> 3) & (UNW_FLAG_EHANDLER | UNW_FLAG_UHANDLER):
        unwind_handler_start = unwind_raw + 4 + ((unwind_code_count+1)&~1) * 2
        unwind_handler = READ_RAW_DWORD(unwind_handler_start)

        # deal with different unwind handlers differently
        if unwind_handler == g_ida_data["gs_handler"]:
            # at time of writing this doesn't require treatment
            pass

        elif unwind_handler == g_ida_data["c_handler"] or \
            unwind_handler == g_ida_data["gs_handler_seh"]:
            # at time of writing both require the same treatment
            # c_handler contains a scope table
            # gs_handler_seh contains a scope table followed by gs specific data(no need update)
            scope_table_raw = unwind_handler_start + 4
            scope_table_data_raw = scope_table_raw + 4
            num_entries = READ_RAW_DWORD(scope_table_raw)

            for _ in range(num_entries):
                begin = READ_RAW_DWORD(scope_table_data_raw)
                g_exception_addresses_to_update.append(ExceptionAddressUpdateInfo(begin, scope_table_data_raw))
                    
                end = READ_RAW_DWORD(scope_table_data_raw+4)
                g_exception_addresses_to_update.append(ExceptionAddressUpdateInfo(end, scope_table_data_raw+4))
                    
                handler = READ_RAW_DWORD(scope_table_data_raw+8)
                if handler and handler != 1:
                    # for EXCEPTION_EXECUTE_HANDLER, handler == 1
                    g_exception_addresses_to_update.append(ExceptionAddressUpdateInfo(handler, scope_table_data_raw+8))

                target = READ_RAW_DWORD(scope_table_data_raw+12)
                if target:
                    # for __finally blocks, target == 0
                    g_exception_addresses_to_update.append(ExceptionAddressUpdateInfo(target, scope_table_data_raw+12))

                scope_table_data_raw += scope_table_entry_size

        else:
            print(f"[*] Warning: Exception handler not implemented -> {hex(unwind_handler)}")

    start_raw += exception_entry_size

Exception related data is stored in an ExceptionAddressUpdateInfo class:

class ExceptionAddressUpdateInfo:
    def __init__(self, orig_rva, update_raw_offset):
        self.orig_rva = orig_rva
        self.update_raw_offset = update_raw_offset
        self.bb_orig_rva = 0
        self.new_offset_in_bb = 0

This allows us to patch the exception addresses after all BBs are instrumented.

Fix CFG

CFG is rather easy to fix.
We just append the instrumented function addresses to the GuardCFFunctionTable.
The reason why we don’t update the table in place is to support partial instrumentation, where some functions may be left uninstrumented in the original section.
The CFG table has to be sorted in ascending order, so that will cause some issues.

Code to calculate the size of each CFG’s entry can be found on MSDN.

1
2
3

start = g_binary.DIRECTORY_ENTRY_LOAD_CONFIG.struct.GuardCFFunctionTable - g_binary.OPTIONAL_HEADER.ImageBase
start_raw = GET_RAW_BY_RVA(start)
entry_size = ((g_binary.DIRECTORY_ENTRY_LOAD_CONFIG.struct.GuardFlags & 0xF0000000) >> 28) + 4

Helper Driver

For kernelmode fuzzing we’ll need a driver to update the .cov section.

I just wrote a simple one to perform read/write/map/unmap, nothing special.

case IOCTL_HELPER_READ_VM:
{
    info = outputLen;
    if (info == 0)
        goto out;

    if (inputLen != sizeof(HELPER_READ_VM_IN)) {
        status = STATUS_INVALID_BUFFER_SIZE;
        info = 0;
        goto out;
    }

    HELPER_READ_VM_IN *inputBuffer = NULL;
    unsigned char     *outputBuffer = NULL;
    inputBuffer = Irp->AssociatedIrp.SystemBuffer;
    outputBuffer = Irp->AssociatedIrp.SystemBuffer;

    RtlCopyMemory(outputBuffer, (unsigned char *)inputBuffer->ReadPtr, outputLen);

    break;
}

case IOCTL_HELPER_WRITE_VM:
{
    info = 0;

    if (inputLen < sizeof(HELPER_WRITE_VM_IN)) {
        status = STATUS_INVALID_BUFFER_SIZE;
        info = 0;
        goto out;
    }

    HELPER_WRITE_VM_IN *inputBuffer = NULL;
    inputBuffer = Irp->AssociatedIrp.SystemBuffer;

    RtlCopyMemory((unsigned char *)inputBuffer->WritePtr, (unsigned char *)inputBuffer->Buffer, inputBuffer->WriteLength);

     break;
}

case IOCTL_HELPER_MAP_VM:
{
    info = outputLen;

    if (inputLen != sizeof(HELPER_MAP_VM_IN) || outputLen != sizeof(HELPER_MAP_VM_OUT)) {
        status = STATUS_INVALID_BUFFER_SIZE;
        info = 0;
        goto out;
    }

    HELPER_MAP_VM_IN  *inputBuffer = NULL;
    HELPER_MAP_VM_OUT *outputBuffer = NULL;
    PMDL              mdl = NULL;
    PVOID             mappedAddress = NULL;

    inputBuffer = Irp->AssociatedIrp.SystemBuffer;
    outputBuffer = Irp->AssociatedIrp.SystemBuffer;

    mdl = IoAllocateMdl((PVOID)inputBuffer->MapPtr, inputBuffer->MapLength, FALSE, FALSE, NULL);
    if (!mdl) {
        status = STATUS_UNSUCCESSFUL;
        err("IoAllocateMdl", status);
        info = 0;
         goto out;
    }

    MmProbeAndLockPages(mdl, KernelMode, IoWriteAccess);

    mappedAddress = MmMapLockedPagesSpecifyCache(mdl, KernelMode, MmNonCached, NULL, FALSE, NormalPagePriority);
    if (!mappedAddress) {
        status = STATUS_UNSUCCESSFUL;
        err("MmMapLockedPagesSpecifyCache", status);
        info = 0;
        goto out;
    }

    outputBuffer->MappedPtr = (ULONG_PTR)mappedAddress;
    outputBuffer->MDLPtr = (ULONG_PTR)mdl;

    break;
}

case IOCTL_HELPER_UNMAP_VM:
{
    info = 0;

    if (inputLen != sizeof(HELPER_UNMAP_VM_IN)) {
        status = STATUS_INVALID_BUFFER_SIZE;
        info = 0;
        goto out;
    }

    HELPER_UNMAP_VM_IN *inputBuffer = NULL;
    inputBuffer = Irp->AssociatedIrp.SystemBuffer;

    if (!inputBuffer->MappedPtr || !inputBuffer->MDLPtr) {
        status = STATUS_INVALID_ADDRESS;
        info = 0;
        goto out;
    }

    MmUnmapLockedPages((PVOID)inputBuffer->MappedPtr, (PMDL)inputBuffer->MDLPtr);
    MmUnlockPages((PMDL)inputBuffer->MDLPtr);
    IoFreeMdl((PMDL)inputBuffer->MDLPtr);

    break;
}

WinAFL Header

We can modify the syzygy header that comes with WinAFL.

On startup:

// Create handle to helper driver

g_helper_driver = CreateFileW(HELPER_NAME, GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, 0, 0);
if (g_helper_driver == INVALID_HANDLE_VALUE) {
    _tprintf(TEXT("[-] Opening handle to helper driver failed.\n"));
    Status = FALSE;
    goto clean;
}

// Map a copy into kernel space, so driver doesn't fault when accessing it in high IRQL

HELPER_MAP_VM_IN  map_address_input = { g_area_ptr, MAP_SIZE };
HELPER_MAP_VM_OUT map_address_output = { 0 };
DWORD             ret = 0;

if (!DeviceIoControl(
    g_helper_driver,
    IOCTL_HELPER_MAP_VM,
    &map_address_input,
    sizeof(HELPER_MAP_VM_IN),
    &map_address_output,
    sizeof(HELPER_MAP_VM_OUT),
    &ret, 
    NULL
    )
) {
    _tprintf(TEXT("[-] Mapping coverage bitmap to kernel space failed.\n"));
    Status = FALSE;
    goto clean;
}

g_kernel_mapped_bitmap = map_address_output.MappedPtr;
g_kernel_mapped_mdl = map_address_output.MDLPtr;

// Write the bitmap's address to target driver's cov section

if (!write_qword(COVERAGE_SECTION_BASE, g_kernel_mapped_bitmap)) {
    _tprintf(TEXT("[-] Writing bitmap address to kernel failed.\n"));
    Status = FALSE;
    goto clean;
};

// Write pid so shellcode starts working

if (!write_qword(COVERAGE_SECTION_BASE + PID_OFFSET, GetCurrentProcessId())) {
    _tprintf(TEXT("[-] Writing pid failed.\n"));
    Status = FALSE;
    goto clean;
}

On exit:

if (g_current_iterations == g_niterations) {

    //
    // For kernel it's generally not useful to "restart" the harness
    // So normally we'll crank -fuzz_iterations real high and almost never restart
    //

    CloseHandle(g_winafl_pipe);
    g_winafl_pipe = INVALID_HANDLE_VALUE;

    // Instruct shellcode to stop logging coverage

    write_qword(COVERAGE_SECTION_BASE + PID_OFFSET, 0);
    write_qword(COVERAGE_SECTION_BASE + PREV_LOC_OFFSET, 0);
    write_qword(COVERAGE_SECTION_BASE, 0);

    // Unmap coverage section

    HELPER_UNMAP_VM_IN unmap_args = { g_kernel_mapped_bitmap, g_kernel_mapped_mdl };
    DWORD ret = 0;
    DeviceIoControl(g_helper_driver, IOCTL_HELPER_UNMAP_VM, &unmap_args, sizeof(HELPER_UNMAP_VM_IN), NULL, 0, &ret, NULL);

    CloseHandle(g_helper_driver);

    UnmapViewOfFile(g_area_ptr);

    Status = FALSE;
    goto clean;
}

It’s also important to clear prev_loc after every iteration so coverage doesn’t get messed up.

Crash Detection

This is not an issue for kernel mode fuzzing, because the system bluescreens on a “crashable” bug and we can’t do much. However since I want my fuzzer to fuzz usermode as well, this is an important topic to consider. It turns out that crashes on Windows are inconsistent as well, and there’s no one fit solution for it.

For starters, please read this and this to understand why we can’t easily catch all Windows crashes.

There are two possible methods(that don’t rely on external telemetry such as WER): to use SetUnhandledExceptionFilter or AddVectoredExceptionHandler. The latter is used by the original syzygy, peafl as well as peafl64.

Handlers registered by AddVectoredExceptionHandler can catch most of the interesting crashes on Windows. However, they supersede frame based handlers, so you get false positives that winafl reports as a crash but the program actually handles it with SEH for example. You may think this is not a big deal, but if the target constantly throws exceptions and handles them, the harness will report all of these as a crash to winafl, and by design winafl will terminate and restart the target process(makes sense because these exceptions are unrecoverable). This renders persistent mode useless, significantly hindering efficiency.

To eliminate false positives, you can use SetUnhandledExceptionFilter to register your handler. This way the handler is called only when no other frame based handlers can handle the exception(i.e. unrecoverable crash), and you can be sure it is not a false positive. Unfortunately, such a handler does not catch stack overflows or STATUS_HEAP_CORRUPTION exceptions as mentioned in the links above. So, we get false negatives instead.

False positives are undesirable, but false negatives are completely unacceptable. In the end I choose to stick with AddVectoredExceptionHandler, but I’ll be conscious to check if the target throws exceptions frequently.
If that’s the case I’ll manually exclude the crash site:

//
// Filter any unwanted exceptions here, perhaps those caught by program
//


if (!g_crashing_module_base)
    g_crashing_module_base = GetModuleHandleW(L"test.dll");

if ((ULONG_PTR)ExceptionInfo->ExceptionRecord->ExceptionAddress - g_crashing_module_base == 0x1022)
    return EXCEPTION_CONTINUE_SEARCH;


// 
// Reaching this point means (you believe) the exception is not handled by the program and should be reported
// 

wprintf(TEXT("[*] The program just crashed.\n"));

if (g_nofuzzing_mode == FALSE) {
    WriteFile(g_winafl_pipe, "C", 1, &Dummy, NULL);
    TerminateProcess(GetCurrentProcess(), 0);
}

This is quite dumb, but I can’t just exclude all crashes if they occur more than a certain number of times, because invoking the default crash handler is more expensive than just terminating the process.

If the reader has a better solution and doesn’t mind sharing, I’ll be happy to discuss!

Test

Example vulnerable driver:

switch (irpSp->Parameters.DeviceIoControl.IoControlCode) {
case IOCTL_PARSE_BUFFER:
{
    if (inputLen < 10) {
        status = STATUS_INVALID_BUFFER_SIZE;
        info = 0;
        goto out;
    }

    char *inputBuffer = Irp->AssociatedIrp.SystemBuffer;

    if (inputBuffer[0] == 'H')
        if (inputBuffer[1] == 'A')
            if (inputBuffer[2] == 'C')
                if (inputBuffer[3] == 'K')
                    if (inputBuffer[4] == 'I')
                        if (inputBuffer[5] == 'N')
                            if (inputBuffer[6] == 'G')
                                if (inputBuffer[7] == '!')
                                    KeBugCheck(0x1337);
    {
        status = STATUS_SUCCESS;
        info = 0;
        goto out;
    }
}

Harness:

int main(int argc, char **argv)
{
    setup_shmem(argv[1]);

    HANDLE hDriver = CreateFileW(DRIVER_NAME, GENERIC_READ | GENERIC_WRITE, 0, NULL, OPEN_EXISTING, 0, NULL);
    DWORD  ret;

    while (__afl_persistent_loop()) {
        DWORD sample_size = *(uint32_t *)(shm_data);
        if (sample_size > MAX_SAMPLE_SIZE) sample_size = MAX_SAMPLE_SIZE;
        char *sample_bytes = (char *)malloc(sample_size);
        memcpy(sample_bytes, shm_data + sizeof(uint32_t), sample_size);

        DeviceIoControl(hDriver, IOCTL_PARSE_BUFFER, sample_bytes, sample_size, NULL, 0, &ret, NULL);

        if (sample_bytes) free(sample_bytes);
    }

    CloseHandle(hDriver);
    
    return 0;
}

For this small test driver we achieve 100% stability and ~40k execs/s, which is really nice. If we copy the coverage buffer manually every round like the original peafl, we’ll only get ~30k. If we do file based sample delivery instead of shared memory we’ll get ~1k, because file ops are really expensive.

Crashing Sample Extraction

Now assuming we attach WinDbg to the fuzzing VM and caught a crash. How do we extract the crashing sample?

We can log the sample to disk before running, but that requires opening the file with FILE_FLAG_NO_BUFFERING. Otherwise the sample will not be flushed to disk immediately, and when the kernel crashes it will be lost. Disabling buffering not only requires us to perform file operations in multiples of 512 bytes, but also results in a huge performance decrease. For the test driver above, logging the sample to disk reduces the efficiency to ~7k execs/s, which is too much of a trade-off in my opinion.

What if we extract the sample directly from the shared memory using WinDbg when the kernel crashes?

0: kd> !vad
VAD             Level         Start             End              Commit
ffff93061aab0590  4           7ffe0           7ffe0               1 Private      READONLY           
ffff93061aab09f0  3           7ffe2           7ffe2               1 Private      READONLY           
ffff93061aab0450  2         7948400         79485ff               7 Private      READWRITE          
ffff93061aab0ae0  5         7948600         79486ff               8 Private      READWRITE          
ffff93061aab0a40  4         7948700         79487ff               5 Private      READWRITE          
ffff93061aab0900  5         7948800         79488ff               4 Private      READWRITE          
ffff93061a479a70  3        25079d50        25079d5f               0 Mapped       READWRITE          Pagefile section, shared commit 0x10
ffff93061a47d670  4        25079d60        25079d62               0 Mapped       READONLY           \Windows\System32\l_intl.nls
ffff93061a868970  1        25079d70        25079d8e               0 Mapped       READONLY           Pagefile section, shared commit 0x1f
ffff93061a46b5b0  3        25079d90        25079d93               0 Mapped       READONLY           Pagefile section, shared commit 0x4
ffff93061a46c050  4        25079da0        25079da0               0 Mapped       READONLY           Pagefile section, shared commit 0x1
ffff93061aab0770  2        25079db0        25079db1               2 Private      READWRITE          
ffff93061a46c190  3        25079dc0        25079dd0               0 Mapped       READONLY           \Windows\System32\C_1252.NLS
ffff93061a46b8d0  0        25079de0        25079df0               0 Mapped       READONLY           \Windows\System32\C_437.NLS
ffff93061a470bf0  5        25079e00        25079e02               0 Mapped       READONLY           \Windows\System32\l_intl.nls
ffff93061aab08b0  4        25079e10        25079e71               2 Private      READWRITE          
ffff93061a47da30  5        25079e80        25079f4d               0 Mapped       READONLY           \Windows\System32\locale.nls
ffff93061a47e570  3        25079f50        25079f60               0 Mapped       READONLY           \Windows\System32\C_1252.NLS
ffff93061a47eb10  4        25079f70        25079f80               0 Mapped       READONLY           \Windows\System32\C_437.NLS
ffff930619ded5d0  5        25079f90        25079f90               0 Mapped       READONLY           Pagefile section, shared commit 0x1
ffff93061aab0810  2        25079fa0        2507a09f              23 Private      READWRITE          
ffff930619ded850  6        2507a0a0        2507a0a0               0 Mapped       READONLY           Pagefile section, shared commit 0x1
ffff930619def1f0  5        2507a0b0        2507a0b0               0 Mapped       READONLY           Pagefile section, shared commit 0x1
ffff930619def330  6        2507a0c0        2507a2bf               0 Mapped       READONLY           Pagefile section, shared commit 0xd
ffff930619deeb10  4        2507a2c0        2507a2c7               0 Mapped       READONLY           Pagefile section, shared commit 0xd
ffff930619def650  5        2507a2d0        2507a450               0 Mapped       READONLY           Pagefile section, shared commit 0x181
ffff930619deebb0  3        2507a460        2507b860               0 Mapped       READONLY           Pagefile section, shared commit 0x52
ffff930619deef70  5        2507b870        2507b870               0 Mapped       READWRITE          Pagefile section, shared commit 0xf5
ffff930619dee570  6        2507b880        2507b88f               0 Mapped       READWRITE          Pagefile section, shared commit 0x10
ffff93061a47be10  4       7ff40acf0       7ff40adef               0 Mapped       READONLY           Pagefile section, shared commit 0x5
ffff93061aab04a0  5       7ff40adf0       7ff50ae0f               0 Private      READWRITE          
ffff93061aab0220  6       7ff50ae10       7ff50ce10               1 Private      READWRITE          
ffff93061a869050  1       7ff50ce20       7ff50ce20               0 Mapped       READONLY           Pagefile section, shared commit 0x1
ffff93061a8688d0  4       7ff6ea270       7ff6ea278               2 Mapped  Exe  EXECUTE_WRITECOPY  \Users\User\Desktop\Release\harness.exe
ffff930619ded350  5       7ff910670       7ff91068a               2 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\vcruntime140.dll
ffff93061a4787b0  3       7ff91ae00       7ff91b1a5               8 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\KernelBase.dll
ffff93061a1c0cb0  4       7ff91b1b0       7ff91b2c7               9 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\gdi32full.dll
ffff93061a1c3370  2       7ff91b2d0       7ff91b369               7 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\msvcp_win.dll
ffff93061a1b8f10  5       7ff91b4e0       7ff91b505               2 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\win32u.dll
ffff930619dede90  4       7ff91b5d0       7ff91b6e0               4 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\ucrtbase.dll
ffff930619def3d0  5       7ff91c420       7ff91c450               3 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\imm32.dll
ffff93061a474110  3       7ff91ca80       7ff91cb43               7 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\kernel32.dll
ffff93061a483250  5       7ff91d090       7ff91d23d               6 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\user32.dll
ffff93061a1ba770  4       7ff91d340       7ff91d368               4 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\gdi32.dll
ffff93061a868dd0  5       7ff91da90       7ff91dca6              16 Mapped  Exe  EXECUTE_WRITECOPY  \Windows\System32\ntdll.dll

We know that the shared memory is backed by the pagefile, has READWRITE permissions, and its mapped size is fully determined by the harness.

Judging from the output above, the shared memory containing the crashing sample is most likely page 0x2507b870, which corresponds to address 0x2507b870000

1
2
3

0: kd> dc 0x2507b870000
00000250`7b870000  0000000d 4b434148 21474e49 12214049  ....HACKING!I@!.
00000250`7b870010  ffff4940 4b000100 ffff3501 0400f7ff  @I.....K.5......

Indeed, it contains the size and bytes of the crashing sample.

Here’s a python script to automate:

import sys
from pykd import dbgCommand

'''
Usage:

.load pykd
!py "C:\\Users\\User\\Desktop\\hacking\\win-kernel\\kernel_instrumentation-main\\post_crash\\dump_sample.py harness.exe 15 C:\\Users\\User\\Desktop\\out.dumpfile"

'''

harness_process_name = sys.argv[1]
max_sample_size = int(sys.argv[2])
output_file = sys.argv[3]
pages_to_map = max_sample_size // 0x1000

res = dbgCommand(f"!process 0 0 {harness_process_name}")
process_base = res.split("\n")[0].split(" ")[1]
dbgCommand(f".process /p {process_base}")
vad_output = dbgCommand("!vad").split("\n")
for line in vad_output:
    if "Pagefile" in line and "READWRITE" in line:
        line_info = line.split()
        start = int(line_info[2], 16)
        end = int(line_info[3], 16)

        if (start - end) != pages_to_map:
            continue

        # Found possible page
        possible_start = start * 0x1000
        possible_sample_size = int(dbgCommand(f"dd {hex(possible_start)} L1").split()[1], 16)

        if possible_sample_size == 0 or possible_sample_size > max_sample_size:
            continue

        # Found target page
        dbgCommand(f".writemem {output_file} {hex(possible_start+4)} L{hex(possible_sample_size)}")
        print("success")
        sys.exit(0)

print("not found, search manually")

Where Fuzzer?

The fuzzer has not gone through extensive testing, and only works for well-behaved Microsoft released binaries at the moment. I’m certain there are edge cases to be supported, and will release it after it reaches a decent level of robustness.

Conclusion

It’s been a fun year studying fuzzing and Windows internals.
I’m happy to make something that reinforces my understanding and is actually useful for my daily work!