Fuzzing Seminar Notes
Notes of papers I surveyed to prepare my seminar homework paper.
Böhme - Fuzzing: Challenges and Reflections
- mutational: seed + mutation
- generational: generation from grammar
box
- black:
input generation with no knowledge of program.
Peach
- grey:
a bit of instrumentation at compile time.
AFL, LibFuzzer, HonggFuzz
- white:
Mutating path condition, not input.
KLEE, SAGE
Enablers of recent Fuzzing surge?
- need: software eating the world
- incentives: bug bounty etc.
- tools: effective, easy to use, OSS fuzzers
Challenges
- More Software: How to fuzz cyber-physical systems, ML systems, stateful
software, polyglot software, GUI-based programs?
- More Bug types: How to identify information leaks, side-channels,
privilege escalation, RCE?
- More Difficult / "deep" Bugs
- More Empirical Studies: What's nature and distribution of vulns in code?
- Human in the Loop
- Usability
- How to asses residual risk after fuzzing?
- theoretical limits of fuzzing?
- How to evaluate special (GUI, ...) fuzzers? overfitting? time budget?
How to evaluate techniques, not implementations?
- Are synthetic / previously discovered bugs representative?
Li - Fuzzing: A Survey
:(
Godefroid - Fuzzing: Hack, Art, and Science
approx three main ways to detect vulns
- static program analysis: fast, shallow, false positives
- manual code inspection: labor-intensive, does not scale
- fuzzing
Whitebox: dynamic SymEx / Concolic Execution
SAGE
Miller - The Relevance of Classic Fuzz Testing: Have We Solved This One?
Rehash of the initial 1990 fuzz approach. Still finds bugs.
Liang - Fuzzing: State of the Art
Fortify binary to hinder fuzzers
Evaluated with libjpeg, pcre2 etc.
- Coverage down ~70% on avg. tho sometimes great, sometimes bad
- Found Bugs down between 60% (LAVA-M dataset) to 90% (binutils v2.3)
- binutils overhead File size: ~60%, CPU: ~4%. GUI app: 5%, 1%
How does it work with actually released software? (ie v1.1, v1.2, ...)?
Each version fuzzified similarly or leaking infos?
Valgrind wasn't adopted by Google devs due to 20x slowdown.
Sanitizers each have ~2x slowdown, thus every Chromium commit is run with all unit tests.
libFuzzer
very easy to use, finds heartbleed in a few seconds.
Hardening: Check values of fn ptrs to only hit expected fns, shadow stack
(somewhere else) so that RA isn't overwritten.
Previous Work
- Mutation-based: Combine AST subtrees from seeds. LangFuzz, IFuzzer, GramFuzz
- Generation-based: Apply JS grammar rules. jsfunfuzz
Contribution
- Dataset: JS enginge Regression tests, PoCs of CVEs. ~30K JS Files
- Trains LSTM Model to generate new code
- Fuzzes ChakraCore (IE)
- Finds (a bit) more Bugs than CodeAlchemist, jsfunfuzz in 72h, found some new bugs.
- Static Analysis: identify schedule-relevant code parts, probabilistically
instrument, increase schedule diversity
- Seed selection that prioritizes new "regular" schedules and new interleavings (?)
- Run with TSan
- Background: SmartTVs run heavily customized AOSP
- Infer input validation from log msgs from Android ROM
- mutate according to logs
[Overview](ampfuss overview.png)
- Looks for UDP requests that excite large responses
- Server fuzzing diffcult/different:
- Timing sensitive socket input: too early (server still starting), too
late (time wasted). Fix by notifying fuzzer.
- non-terminating handle-loop: Transform into one-shot program by
indentifying sources, sinks
- Diffcult to source over-the-air: Closed source impl, timing constraints.
Current Approach either manual input generation or too many invalid inputs
generated by mutation.
- Approaches for BT fuzzing
- Use Bluetooth Dongle: cannot fuzz low-lvl stuff
- Emulation: well, you need emulator
- Their approach: Use special Bluetooth hardware (to send arbitraish
data?), be very smart about state.
- Coverage-guided fuzzing timeline
- Bunny-the-fuzzer (2007)
- SAGE 2008
- AFL
- Heartbleed
- 2011-12-31: Introduced into OpenSSL
- 2014-03: Found by audit and, independently, specialized fuzzer
- 2015-04-07: Hanno Böck: AFL finds in 6 hours
- 2015-04-09: Serebryany: libFuzzer in 10 seconds
- Problem: Fuzzing done by security researchers, not by code owners
- libFuzzer features
- fuzzing corpus: previously recorded relevant inputs
- real inputs for mutations
- fuzzing dict: important strings
- timeouts are problematic when fuzzing: can't reach further
- Continuous Fuzzing / Fuzz-Driven Development
- APIs as Fuzz Target
- Tests as Seed Corpus
- CI includes Continuous Fuzzing
- Summary
- Coverage-guided fuzzing is easy
- Fuzzing must be continuous, automated, maintained by code owners
- Fuzzing vs Concolic Execution
- Fuzzing: Good at general inputs. Problem: specific inputs
- ConcExe: Good at specific inputs. Problem: state explosion
- Idea Hybrid fuzzing: Only fork at hard-to-fuzz branches.
E.g. Driller.
Hybrid fuzzing for "real world software" (Driller rather CGC specific?)
Note: Requires Developer hints
- Tradidional Obfuscation useless for fuzzing.
- "Dragnet-style fuzzing": automatically fuzz the top 500 AUR packages
(cf Fuzzing Ex Machina)
- Fuzzing History:
- First Fuzzing Paper (?) by Duran and Ntagos
- Blind / Black-box Fuzzers
- mutational: ZZUF, RADAMSA
- AFL: efficient coverage info
- Coverage-Info Handling:
- AFL: Uses bitmap to approximate BB edges.
Angora: XORs stack with edge.
- Vuzzer: Exact BB coverage, BB weighting depending on how "deep" in
control flow. Weights -> Fitness functions -> evolutionary algo for
mutation fuzzing.
- Hybrid Fuzzing: Concolic Exe to reach improbable paths.
- Central Fuzzing Assumptions
- Coverage relevant
- Crashes can be detected
- Many executions per second
- Constraints are solvable
- Attacks
- coverage -> irrelevant ("fake") code
- crash detection -> custom signal handler with exit 0
- many executions -> slowdown for weird input
- constraints solvable -> hash comparisons
- uses existing cov-guided fuzzer (AFL, honggfuzz) to generate inputs
- When fuzzer gets "stuck": finds and removes non-critical sanity checks
- crash analyzer: Check that input also crashes real program by collecting
path constraints and ensure that they are satisfiable.
Filters out false positives with 6~30% false negatives
- Based on AFL.
- Aims to balance Exploration/Exploitation of seed discovery.
- Improved Search/Scheduling Algorithm.
- Problems with AFL, libFuzzer: low coverage, manual, not scalable
- Use lib and its consumers to infer API dependence graph (which functions to
call in which order), then generate libFuzzer stub.
- Eval against manually written fuzzers:
- a bit more code cov (~55% for FuzzGen vs ~48% for manual)
- less bugs found (17 vs 29): "Manual fuzzers test more thoroughly 'buggy' parts"
On avg. 5x speedup
Predicts reachability of program inputs without execution.
-
Problem: Sanity checking branches
-
Solution: Hybrid Fuzzers with Problems constraint solving, path explosion,
environment model.
-
RQ1: lightweight and accurate taint analysis?
-
RQ2: How to efficiently guide mutation with taint?
-
RQ3: How to tune fuzzer's evolution direction with data flow features?
based on Angora.
target branches that are instrumented by sanitizers
Maybe prune some ASan instrumentations by prefering profiling-wise "cold" or
complex code.
Use Data Flow Analysis to solve path conditions.
Fuzzed Contiki-NG (low-resource IoT OS) using 8 fuzzers over three years.
AFL: impressively easy to use, 55 "unique" crashes caused by on (shallow) bug
in one hour.
Suggestion: early fixing.
-
RQ1: hybrid fuzzers > mutation-based fuzzers?
- "Did not detect any clear superiority of hybrid fuzzers wrt ability to
expose bugs compared to mutation-based fuzzers"
-
RQ2: employed techniques to expose bugs fast?
- MOpt, SymCC, QSYM stand out in terms of finding bugs fast
-
RQ3: Are there fuzzers that expse bugs everytime?
- MOpt, SymCC, QSYM stand out in terms of finding bugs consistently
-
RQ4: Do Sanitizers pay off for their runtime overhead in a time-limited run?
- ASAN pays of for more difficult bugs
- EffectiveSan generally pays of
- Challenge: Automating vuln discovery (Main)
- fuzz more systems?
- detect more vulns?
- find deep bugs?
- Empirical nature of undiscovered vulnerabilities?
- Challenge: Evaluation and Benchmarking / fair fuzzer benchmark
- How to evaluate specialised fuzzers?
- How to prevent overfitting? (Goodhart's Law)
- Continuous Eval à la FuzzBench
- What is a good performance measure?
- "Time to retire Lava & CGC, they are actively harmful"
- Eval techniques, not implementations?
- "LibFuzzer, AFL++, HonggFuzz went through major performance
improvements -- enabled by FuzzBench"
Case instead of time as x-axis.
For example, if I wanted to prototype a new mutation strategy for AFL, I
would be forced to do it in C, avoid inefficient copies, avoid mallocs, etc.
I effectively have to make sure my mutator is at-or-better than existing AFL
mutator performance to use benchmarks like this.
-
10x faster than state-of-the-art
-
found bugs in Perl, ErlangVM
-
AFL's input generation+prune algorithm is a markov chain
- path: file that is generated from other file and retained by AFL since
it discovered new path
- energy: how often the file is mutated and executed
- probability ij: prob. to find interesting file
j
by mutating i
-
AFL gives every input from queue same energy, i.e. fuzzes that input 80k
times.
- might be too much/too little energy to find new paths.
AFLFast instead exponentially increases energy.
- focuses execution on hot paths: Half of the inputs in campaign of
nm
exercises three paths.
AFLFast assigns energy inversely proportional to "hotness"
-
Search strategy:
- AFL: FIFO
- AFLFast: "most progressive first": prio on seeds that either exercise
"colder" paths or have been chosen less often. Also don't over-prioritze
seed.
-
Eval (24 hour runs):
- finds 6 CVEs in
nm
seven times faster.
- finds an order of magnitude more crashing inputs in
cxxfilt
, nm
, objdump
Discussion of Boehme and Zalewski
weaker precursor: Dowser
builds on top of AFLFast
- Use cases for directed fuzzing: patch testing, crash reproduction, etc.
- SymEx can easily be directed (constraint satisfaction problem).
- To direct AFL, it has to be viewed as optimisation problem.
- At instrumentation time:
- Extract call graph, control flow graph
- For each basic block (BB), compute (approx) distance to target
- Instrument program to aggregate distance values
- At runtime:
- Collect coverage, distance info
- Fuzz an input proportional to its distance to target
- Implementation: two uint64 in AFL's
shared_mem
map for each BB:
- Aggregated BB-level distance value
- Number of executed BBs
- Direction by influencing seed's energy (c.f. AFLFast):
More energy (no. of mutation+execs) to seed that is closer to target
But not gradient descent (problem: local minima) but simulated
annealing (first exploration,
then exploitation).
- Eval:
- redo KATCH (KLEE based patch testing) benchmark
- 175 diffutils patches, 181 binutils patches
- reaches more destinations than KATCH(!) (also much faster obv.)
- redo BugRedux (KLEE based crash repro) benchmark
- 7 crashes in
sed
, grep
, gzip
, ...
- Able to reproduce 6 of 7 crashes, BugRedux only did 2 out of 7.
Hypervisor + Intel CPU Extensions for near-native performance for kernel
fuzzing
Intel Processor Trace: traces taken/nottaken of branch, target addr of jump,
Fuzzes kernel in virtual machine.
Fuzzer communicates with agent that sits below/beneath (?) kernel in vm.
Fuzzer receives coverage info from Intel PT Driver.
Uni Bochum
builds on top of kAFL
- Roadblocks:
- Magic Numbers:
if input == $magic then 1 / 0 else 1
since unlikely
- nested checksums
- Idea: Input-to-State Correspondence
- determine
cmp
instructions, place hook there
- When hit, record actual and expected value,
e.g. act:
0xcafe
, exp: 0x1234
.
replace(0xcafe, 0x1234)
in next fuzz input
- Binary instrumentation, thus 10x slowdown, thus only done for new queue
entries
- Problem: inputs with eg many 0s, and
replace(0, ...)
.
Solution: find more random input that executes same trace.
- Patch out checksum stuff etc (as in T-Fuzz).
To weed out false positives, don't use SymEx after the fuzz campaign
(since queue still full of false positives) but input-state-corresp.
during the campaign.
- Handle nested checksums by topo sort and then fixing in the right order.
- Eval:
- All binutils that read (not write) a single input file
- outperforms kAFL, Laf-Intel by 2x
- Slower since than others since kAFL reaches longer paths
- Speakers thoughts on eval:
- LAVA-M is completely solved
- Next-gen dataset: "large number of real world programs with coverage"
since afaiu everything that's hard for a fuzzer should reflect in
coverage (e.g. hard to reach constraints)
Driller: Augmenting Fuzzing Through Selective Symbolic Execution
Combines AFL and angr (concolic execution for binary;
paper,
website).
core intuition: input can be divided into general and specific with many
and few accepted inputs. Checks for specific input divide app into
compartments.
- Fuzzes with AFL until it deems itself stuck.
- Invokes concolic execution engine, passing all interesting inputs from
fuzzing.
- Concolic execution determines inputs that reach new paths/compartments.
"smart" without symex
- AFL struggles with magic bytes, eg
if buf[5] == 0xD8 %% buf[6] == 0xff
- does magic byte detection by extracting
cmp
immediate inputs
- does dynamic taint analysis to correspond input bytes to
cmp
input
on first execution of seed
- Also (de)prioritizes basic blocks
(+ if deeply nested, - if determined to be error handling)