Fuzzing Seminar Notes

Notes of papers I surveyed to prepare my seminar homework paper.

Böhme - Fuzzing: Challenges and Reflections

mutational: seed + mutation
generational: generation from grammar

box

black: input generation with no knowledge of program. Peach
grey: a bit of instrumentation at compile time. AFL, LibFuzzer, HonggFuzz
white: Mutating path condition, not input. KLEE, SAGE

Enablers of recent Fuzzing surge?

need: software eating the world
incentives: bug bounty etc.
tools: effective, easy to use, OSS fuzzers

Challenges

More Software: How to fuzz cyber-physical systems, ML systems, stateful software, polyglot software, GUI-based programs?
More Bug types: How to identify information leaks, side-channels, privilege escalation, RCE?
More Difficult / "deep" Bugs
More Empirical Studies: What's nature and distribution of vulns in code?
Human in the Loop
Usability
How to asses residual risk after fuzzing?
theoretical limits of fuzzing?
How to evaluate special (GUI, ...) fuzzers? overfitting? time budget? How to evaluate techniques, not implementations?
Are synthetic / previously discovered bugs representative?

Li - Fuzzing: A Survey

Godefroid - Fuzzing: Hack, Art, and Science

approx three main ways to detect vulns

static program analysis: fast, shallow, false positives
manual code inspection: labor-intensive, does not scale
fuzzing

Whitebox: dynamic SymEx / Concolic Execution

SAGE

Miller - The Relevance of Classic Fuzz Testing: Have We Solved This One?

Rehash of the initial 1990 fuzz approach. Still finds bugs.

Liang - Fuzzing: State of the Art

Jung - Fuzzification: Anti-Fuzzing Techniques

Fortify binary to hinder fuzzers

Add delay (random code) to unusual/rarely from user visited paths
fake, input-sensitive paths
Using hash function in comparisons (from x == 1 to h(x) == 0x123)

Evaluated with libjpeg, pcre2 etc.

Coverage down ~70% on avg. tho sometimes great, sometimes bad
Found Bugs down between 60% (LAVA-M dataset) to 90% (binutils v2.3)
binutils overhead File size: ~60%, CPU: ~4%. GUI app: 5%, 1%

How does it work with actually released software? (ie v1.1, v1.2, ...)? Each version fuzzified similarly or leaking infos?

Serebryany - Sanitize, Fuzz, and Harden Your C++ Code

Valgrind wasn't adopted by Google devs due to 20x slowdown.

Sanitizers each have ~2x slowdown, thus every Chromium commit is run with all unit tests.

libFuzzer very easy to use, finds heartbleed in a few seconds.

Hardening: Check values of fn ptrs to only hit expected fns, shadow stack (somewhere else) so that RA isn't overwritten.

Lee - Montage: A NN Language Model-Guided Javascript Engine Fuzzer

Previous Work

Mutation-based: Combine AST subtrees from seeds. LangFuzz, IFuzzer, GramFuzz
Generation-based: Apply JS grammar rules. jsfunfuzz

Contribution

Dataset: JS enginge Regression tests, PoCs of CVEs. ~30K JS Files
Trains LSTM Model to generate new code
Fuzzes ChakraCore (IE)
Finds (a bit) more Bugs than CodeAlchemist, jsfunfuzz in 72h, found some new bugs.

Chen - MUZZ: Thread-aware Grey-box Fuzzig for Effective Bug Hunting in Multithreaded Programs

Static Analysis: identify schedule-relevant code parts, probabilistically instrument, increase schedule diversity
Seed selection that prioritizes new "regular" schedules and new interleavings (?)
Run with TSan

Aafer - Android SmartTVs Vuln Discovery via Log-Guided Fuzzing

Background: SmartTVs run heavily customized AOSP
Infer input validation from log msgs from Android ROM
mutate according to logs

Krupp - AmpFuzz: Fuzzing for Amplifiction DDoS Vulns

[Overview](ampfuss overview.png)

Looks for UDP requests that excite large responses
Server fuzzing diffcult/different:
- Timing sensitive socket input: too early (server still starting), too late (time wasted). Fix by notifying fuzzer.
- non-terminating handle-loop: Transform into one-shot program by indentifying sources, sinks

Garbelini - BrakTooh: Directed Fuzzing of BlueTooth Link Manager

Diffcult to source over-the-air: Closed source impl, timing constraints. Current Approach either manual input generation or too many invalid inputs generated by mutation.
Approaches for BT fuzzing
- Use Bluetooth Dongle: cannot fuzz low-lvl stuff
- Emulation: well, you need emulator
- Their approach: Use special Bluetooth hardware (to send arbitraish data?), be very smart about state.

Serebryany - OSS-Fuzz: Google's continuous fuzzing service for open source software

Coverage-guided fuzzing timeline
- Bunny-the-fuzzer (2007)
- SAGE 2008
- AFL
Heartbleed
- 2011-12-31: Introduced into OpenSSL
- 2014-03: Found by audit and, independently, specialized fuzzer
- 2015-04-07: Hanno Böck: AFL finds in 6 hours
- 2015-04-09: Serebryany: libFuzzer in 10 seconds
- Problem: Fuzzing done by security researchers, not by code owners
libFuzzer features
- fuzzing corpus: previously recorded relevant inputs
- real inputs for mutations
- fuzzing dict: important strings
timeouts are problematic when fuzzing: can't reach further
Continuous Fuzzing / Fuzz-Driven Development
- APIs as Fuzz Target
- Tests as Seed Corpus
- CI includes Continuous Fuzzing
Summary
- Coverage-guided fuzzing is easy
- Fuzzing must be continuous, automated, maintained by code owners

Yun - QSYM: A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing

Fuzzing vs Concolic Execution
- Fuzzing: Good at general inputs. Problem: specific inputs
- ConcExe: Good at specific inputs. Problem: state explosion
- Idea Hybrid fuzzing: Only fork at hard-to-fuzz branches. E.g. Driller.

Hybrid fuzzing for "real world software" (Driller rather CGC specific?)

Güler - AntiFuzz: Impeding Fuzzing Audits of Binary Executables

Note: Requires Developer hints

Tradidional Obfuscation useless for fuzzing.
"Dragnet-style fuzzing": automatically fuzz the top 500 AUR packages (cf Fuzzing Ex Machina)
Fuzzing History:
1. First Fuzzing Paper (?) by Duran and Ntagos
2. Blind / Black-box Fuzzers
  - mutational: ZZUF, RADAMSA
- generational: PEACH
1. AFL: efficient coverage info
Coverage-Info Handling:
- AFL: Uses bitmap to approximate BB edges. Angora: XORs stack with edge.
- Vuzzer: Exact BB coverage, BB weighting depending on how "deep" in control flow. Weights -> Fitness functions -> evolutionary algo for mutation fuzzing.
- Hybrid Fuzzing: Concolic Exe to reach improbable paths.
Central Fuzzing Assumptions
- Coverage relevant
- Crashes can be detected
- Many executions per second
- Constraints are solvable
Attacks
- coverage -> irrelevant ("fake") code
- crash detection -> custom signal handler with exit 0
- many executions -> slowdown for weird input
- constraints solvable -> hash comparisons

Peng - T-Fuzz: fuzzing by program transformation

uses existing cov-guided fuzzer (AFL, honggfuzz) to generate inputs
When fuzzer gets "stuck": finds and removes non-critical sanity checks
crash analyzer: Check that input also crashes real program by collecting path constraints and ensure that they are satisfiable. Filters out false positives with 6~30% false negatives

Yue - EcoFuzz: Adaptive Energy-Saving Greybox Fuzzing as a Variant of the Adversarial Multi-Armed Bandit

Based on AFL.
Aims to balance Exploration/Exploitation of seed discovery.
Improved Search/Scheduling Algorithm.

Ispoglou - FuzzGen: Automatic Fuzzer Generation

Problems with AFL, libFuzzer: low coverage, manual, not scalable
Use lib and its consumers to infer API dependence graph (which functions to call in which order), then generate libFuzzer stub.
Eval against manually written fuzzers:
- a bit more code cov (~55% for FuzzGen vs ~48% for manual)
- less bugs found (17 vs 29): "Manual fuzzers test more thoroughly 'buggy' parts"

Zong - FuzzGuard: Filtering out Unreachable Inputs in Directed Grey-box Fuzzing through Deep Learning

On avg. 5x speedup

Predicts reachability of program inputs without execution.

Gan - GREYONE: Data Flow Sensitive Fuzzing

Problem: Sanity checking branches
Solution: Hybrid Fuzzers with Problems constraint solving, path explosion, environment model.
RQ1: lightweight and accurate taint analysis?
RQ2: How to efficiently guide mutation with taint?
RQ3: How to tune fuzzer's evolution direction with data flow features?

Österlung - ParmeSan: Sanitizer-guided Greybox Fuzzing

based on Angora.

target branches that are instrumented by sanitizers

Maybe prune some ASan instrumentations by prefering profiling-wise "cold" or complex code.

Use Data Flow Analysis to solve path conditions.

Poncelet - So Many Fuzzers, So Little Time

Fuzzed Contiki-NG (low-resource IoT OS) using 8 fuzzers over three years.

AFL: impressively easy to use, 55 "unique" crashes caused by on (shallow) bug in one hour. Suggestion: early fixing.

RQ1: hybrid fuzzers > mutation-based fuzzers?
- "Did not detect any clear superiority of hybrid fuzzers wrt ability to expose bugs compared to mutation-based fuzzers"
RQ2: employed techniques to expose bugs fast?
- MOpt, SymCC, QSYM stand out in terms of finding bugs fast
RQ3: Are there fuzzers that expse bugs everytime?
- MOpt, SymCC, QSYM stand out in terms of finding bugs consistently
RQ4: Do Sanitizers pay off for their runtime overhead in a time-limited run?
- ASAN pays of for more difficult bugs
- EffectiveSan generally pays of

Böhme - Fuzzing Challenges and Reflections

Challenge: Automating vuln discovery (Main)
- fuzz more systems?
- detect more vulns?
- find deep bugs?
- Empirical nature of undiscovered vulnerabilities?
Challenge: Evaluation and Benchmarking / fair fuzzer benchmark
- How to evaluate specialised fuzzers?
- How to prevent overfitting? (Goodhart's Law)
  - Continuous Eval à la FuzzBench
- What is a good performance measure?
  - "Time to retire Lava & CGC, they are actively harmful"
- Eval techniques, not implementations?
- "LibFuzzer, AFL++, HonggFuzz went through major performance improvements -- enabled by FuzzBench"

Some Fuzzing Thoughts

Case instead of time as x-axis.

For example, if I wanted to prototype a new mutation strategy for AFL, I would be forced to do it in C, avoid inefficient copies, avoid mallocs, etc. I effectively have to make sure my mutator is at-or-better than existing AFL mutator performance to use benchmarks like this.

Coverage-based Greybox Fuzzing as Markov Chain (AFLFast)

10x faster than state-of-the-art
found bugs in Perl, ErlangVM
AFL's input generation+prune algorithm is a markov chain
- path: file that is generated from other file and retained by AFL since it discovered new path
- energy: how often the file is mutated and executed
- probability ij: prob. to find interesting file j by mutating i
AFL gives every input from queue same energy, i.e. fuzzes that input 80k times.
- might be too much/too little energy to find new paths. AFLFast instead exponentially increases energy.
- focuses execution on hot paths: Half of the inputs in campaign of nm exercises three paths. AFLFast assigns energy inversely proportional to "hotness"
Search strategy:
- AFL: FIFO
- AFLFast: "most progressive first": prio on seeds that either exercise "colder" paths or have been chosen less often. Also don't over-prioritze seed.
Eval (24 hour runs):
- finds 6 CVEs in nm seven times faster.
- finds an order of magnitude more crashing inputs in cxxfilt, nm, objdump

Discussion of Boehme and Zalewski

Directed Greybox Fuzzing (AFLGo)

weaker precursor: Dowser

builds on top of AFLFast

Use cases for directed fuzzing: patch testing, crash reproduction, etc.
SymEx can easily be directed (constraint satisfaction problem).
To direct AFL, it has to be viewed as optimisation problem.
- At instrumentation time:
  - Extract call graph, control flow graph
  - For each basic block (BB), compute (approx) distance to target
  - Instrument program to aggregate distance values
- At runtime:
  - Collect coverage, distance info
  - Fuzz an input proportional to its distance to target
Implementation: two uint64 in AFL's shared_mem map for each BB:
- Aggregated BB-level distance value
- Number of executed BBs
- Direction by influencing seed's energy (c.f. AFLFast): More energy (no. of mutation+execs) to seed that is closer to target But not gradient descent (problem: local minima) but simulated annealing (first exploration, then exploitation).
Eval:
- redo KATCH (KLEE based patch testing) benchmark
  - 175 diffutils patches, 181 binutils patches
  - reaches more destinations than KATCH(!) (also much faster obv.)
- redo BugRedux (KLEE based crash repro) benchmark
  - 7 crashes in sed, grep, gzip, ...
- Able to reproduce 6 of 7 crashes, BugRedux only did 2 out of 7.

kAFL: Hardware-Assisted Feedback Fuzzing for OS Kernels

Hypervisor + Intel CPU Extensions for near-native performance for kernel fuzzing

Intel Processor Trace: traces taken/nottaken of branch, target addr of jump,

Fuzzes kernel in virtual machine. Fuzzer communicates with agent that sits below/beneath (?) kernel in vm. Fuzzer receives coverage info from Intel PT Driver.

REDQUEEN: Fuzzing with Input-to-State Correspondence

Uni Bochum

builds on top of kAFL

Roadblocks:
- Magic Numbers: if input == $magic then 1 / 0 else 1 since unlikely
- nested checksums
Idea: Input-to-State Correspondence
- determine cmp instructions, place hook there
- When hit, record actual and expected value, e.g. act: 0xcafe, exp: 0x1234.
- replace(0xcafe, 0x1234) in next fuzz input
- Binary instrumentation, thus 10x slowdown, thus only done for new queue entries
- Problem: inputs with eg many 0s, and replace(0, ...). Solution: find more random input that executes same trace.
- Patch out checksum stuff etc (as in T-Fuzz). To weed out false positives, don't use SymEx after the fuzz campaign (since queue still full of false positives) but input-state-corresp. during the campaign.
- Handle nested checksums by topo sort and then fixing in the right order.
Eval:
- All binutils that read (not write) a single input file
- outperforms kAFL, Laf-Intel by 2x
- Slower since than others since kAFL reaches longer paths
Speakers thoughts on eval:
- LAVA-M is completely solved
- Next-gen dataset: "large number of real world programs with coverage" since afaiu everything that's hard for a fuzzer should reflect in coverage (e.g. hard to reach constraints)

Driller: Augmenting Fuzzing Through Selective Symbolic Execution

Combines AFL and angr (concolic execution for binary; paper, website).

core intuition: input can be divided into general and specific with many and few accepted inputs. Checks for specific input divide app into compartments.

Fuzzes with AFL until it deems itself stuck.
Invokes concolic execution engine, passing all interesting inputs from fuzzing.
Concolic execution determines inputs that reach new paths/compartments.

Vuzzer: Application-aware Evolutionary Fuzzing

"smart" without symex

AFL struggles with magic bytes, eg if buf[5] == 0xD8 %% buf[6] == 0xff
does magic byte detection by extracting cmp immediate inputs
does dynamic taint analysis to correspond input bytes to cmp input on first execution of seed
Also (de)prioritizes basic blocks (+ if deeply nested, - if determined to be error handling)