Quick jump: Background * Instructions * Submission
In this assignment you are provided with the skeleton of a random and coverage-guided fuzz testing tool for
small Python programs. This assignment has a programming component in which you will implement the core
heuristics of the fuzz testing tools. It also has an experimental component in which evaluate the tools you
have built on some small benchmark programs. For the experimental component, it is recommended that you
run each designated technique on each benchmark 3 times, for 1 minute each. This means a total of about 20
minutes of experiment runtime if all experiments are run sequentially.
Learning Goals
- Construct heuristics for the core of a random and a coverage-guided fuzzer
- Identify when coverage-guided fuzzing might yield better performance than random fuzzing and vice-versa
- Analyze the effect of input seeds on the performance of coverage-guided fuzzing
- Compare the performance of structured fuzzing with different levels of structure specification
Questions
Ask questions, report potential typos, etc., about the assignment on the public class
Piazza as much as possible so other students can benefit from the clarifications.
Background
In class we will read two papers which cover the basics of random, coverage-guided, and generator-based
fuzzing. The September 14th lecture will also cover these topics. The following recaps the core background.
Random Fuzzing
See also the description of "fuzz" in the
original fuzzing paper. A random fuzzer simply produces totally random inputs and sends these to the
program
under test for the duration of the fuzzing session.
The pseudo-code for random fuzzing is as follows:
def random_fuzz(program_to_test, max_search_time):
start_time = current_time()
while (current_time() - start_time < max_search_time):
new_input = generate_random_input()
has_error, input_coverage = execute_on_input(program_to_test, new_input)
if (has_error):
save_as_error(new_input)
It is not necessary to keep track of coverage achieved while performing random fuzzing. Our implementation
does this is so you can compare its coverage to that of coverage-guided fuzzing (see the discussion in ``Bonus
Exploration'').
Coverage-Guided Fuzzing
See also Section 2.2 of the Zest paper.
The core algorithm of coverage-guided fuzzing you will be implementing is the following:
def coverage_guided_fuzz(program_under_test, seed_inputs, max_search_time):
start_time = current_time()
saved_inputs = set(seed_inputs)
while (current_time() - start_time < max_search_time):
for saved_input in saved_inputs:
if random_float() < FUZZ_PROB(saved_input):
num_mutants = NUM_MUTANTS(saved_input)
for _ in range(0, num_mutants):
mutated_input = MUTATE_INPUT(saved_input)
has_error, input_coverage = execute_on_input(
mutated_input
)
if (has_error):
save_as_error(mutated_input)
elif (has_new_coverage(input_coverage)):
saved_inputs.add(mutated_input)
The real implementation in fuzzer.py includes some additional statements for statistics tracking
purposes, as well as looping over saved_inputs that is safe to the expansion of
saved_inputs .
In a nutshell, the algorithm starts from a set of seed inputs (or some arbitrarily-generated input if no
seed input is provided). These populate the set saved_input . Coverage-guided fuzzing produces
new inputs by repeatedly mutating the inputs in saved_inputs . Any new mutated inputs which are
found to increase code coverage are added to saved_inputs , and the process repeats.
Three core heuristics control the performance of coverage-guided fuzzing:
- We may not always want to mutate all the inputs in
saved_inputs , especially as the size of
saved_inputs grows. FUZZ_PROB , aka the fuzz_prob method in
CoverageGuidedFuzzer , controls the likelihood of selecting an input in
saved_inputs for mutation.
- Mutating + running mutated inputs is time-consuming. So we may not want to produce as many mutated
inputs from each input selected for mutation.
NUM_MUTANTS , aka the num_mutants
method in CoverageGuidedFuzzer , controls how many mutants we will generate from a given
parent input.
- Above we refer to the process of mutating inputs, but how exactly should this be done? Small mutations
allow targeted exploration near a particular input; larger mutations may allow for faster exploration of
different behaviors.
MUTATE_INPUT , aka the mutate_input method in
CoverageGuidedFuzzer defines how to mutate a particular input. This method should certainly
be non-deterministic.
In practice, other aspects of the algorithm, such as deciding when to save an input, can also affect the
performance of
fuzzing. For the purposes of this assignment, we will consider these fixed.
Generator-Based Fuzzing
See also Section 2.1 of the Zest paper.
At its conceptual core, generator-based fuzzing is simply random fuzzing, where, rather than generating a
random input from scratch, a provided input generator is called to generate inputs:
def generator_fuzz(program_to_test, max_search_time, input_generator):
start_time = current_time()
while (current_time() - start_time < max_search_time):
new_input = input_generator()
has_error, input_coverage = execute_on_input(program_to_test, new_input)
if (has_error):
save_as_error(new_input)
You will not be implementing a generator-based fuzzer; the benchmarks
cgi_decode_good_generator.py and cgi_decode_ok_generator.py build the generator
into the test driver. For instance, the test_one_input function (i.e., the program under test)
defined in
cgi_decode.py is:
def test_one_input(input_data: bytes):
try:
input_str = input_data.decode("UTF-8")
cgi_decode(input_str)
except ValueError:
# Invalid input, but not a bug
pass
While in cgi_decode_ok_generator.py it is:
def test_one_input(input_data: bytes):
try:
input_str = generate_cgi_string(input_data)
cgi_decode(input_str)
except ValueError:
# Invalid input, but not a bug
pass
The generator generate_cgi_string uses
the provided sequence of bytes to build the input, rather than polling a random source to make decisions
about input generation. Refer to the implementation in generators/cgi_string.py for how exactly
this works. In the Zest paper, we refer to these generators as parametric generators.
Code Coverage
Line coverage refers to the lines executed at least once by the program under test. Branch
coverage refers to whether both "sides" of an if statement (or while
statement, or switch statement...) are covered.
For example, consider the program foo here:
def foo(x, y):
z = 2 * x
if z > y:
z = y
return z + y
- The call
foo(3,2) gets full line-coverage of the program: all lines are executed. It does
not get full branch coverage, since only the true side of if z > y: is executed.
- The call
foo(3,7) misses one line (z = y ) so does not get full line coverage.
It does not get full branch coverage, since only the false side of if z > y: is executed.
- Together, the calls to
foo(3,2) and foo(3,7) achieve full line + branch
coverage of the program.
In this assignment, we use the coverage.py library
to measure coverage. The fuzzer only considers branch (aka "edge" or "arc") coverage as "new coverage", but
the produced html coverage reports also show you the line coverage of the program under test.
In the coverage.py html report, lines covered are highlighted green; lines not covered are highlighted red
red. Lines containing a branch which is not fully covered (i.e., only the true or false case is covered, not
both) are in yellow.
Instructions
Collaboration policy
You may discuss the assignment with other students but all coding and the writeup should be done on your
own.
Startup
You will need a machine that runs Python 3.8+. One external package is needed to run the fuzzing, tool,
coverage.py . You should be able to install this with pip (either
pip install converage or pip3 install coverage depending on your system); see
instructions here.
Download the startup code here: cpsc539l-assignment-materials.zip. After unpacking the
archive, you will be left with a
directory with the following structure:
-
fuzzer.py : this is the file in which you will complete your implementation tasks
-
benchmarks : this folder contains several benchmarks on which you can run your fuzzer
-
quicksort.py : an implementation of quicksort, you can use this to quickly run
-
cgi_decode.py : a decoder of CGI-encoded strings, which accepts the
inputs directly generated by the fuzzer.
-
cgi_decode_ok_generator.py : a decoder of CGI-encoded strings, where the fuzzer-generated
bytes are used to generate a string with valid CGI characters, which is then sent to the
decoder. This is effectively a generator-based fuzzer with a very simple generator.
-
cgi_decode_good_generator.py : a decoder of CGI-encoded strings, where the
fuzzer-generated bytes are used to generate a valid CGI string, which is then sent to the
decoder. This is effectively a generator-based fuzzer with a generator which only creates valid inputs
for the CGI decoder.
-
generators : this folder contains generators which convert a fuzzer-generated bytestring
into more structured data
-
cgi_str.py : contains the generator for valid CGI strings (used in the
"good " benchmark), and the generator for strings
with valid CGI characters (used in the "ok " benchmark).
-
seeds : this folder contains an input file to seed coverage-guided fuzzing with on the
cgi_decode benchmark
The core code of the cgi_decode benchmark is taken from the Fuzzing Book.
Programming Component
You have four functions to fill in. After filling these functions in you should be able to run (a)
your random fuzzer and (b) your coverage guided fuzzer on the provided benchmarks.
-
Random fuzzer: fill in the generate_input method in the RandomFuzzer class in
order to complete the random fuzzer.
-
Coverage-guided fuzzer: there are 3 methods to fill in the CoverageGuidedFuzzer class:
-
fuzz_prob(input) : given a saved input , should we use it as a parent for
mutation? If fuzz_prob returns 1, we will always use it as a parent for mutation; if it
returns 0, it will not be used as a parent for mutation.
-
num_mutants(input) : given a saved input selected for mutation, how many
"children" (or "mutant") inputs should we create from it?
mutate_input(input_data) : given a byte-string input, create a new byte-string input
from
it. Note that this method should definitely be non-deterministic, i.e. not always return the same
mutant.
You can test your random fuzzer on the quicksort benchmark as follows:
$ python3 fuzzer.py benchmarks.quicksort <OUTPUT_DIR_NAME> --fuzzer RANDOM --time 10
and you can test your coverage-guided fuzzer on the quicksort benchmark as follows:
$ python3 fuzzer.py benchmarks.quicksort <OUTPUT_DIR_NAME> --fuzzer COVERAGE --time 10
Describe the final implementation you chose for each of these four functions, and why you chose this implementation, in your assignment writeup.
Hints
-
Our fuzzers produce Python bytestring inputs, rather than regular python strings. You can construct a
python
bytes object from a list of integers lst , where each x in
the list is 0 <= x < 255 , with the function bytes() , i.e.:
bytes(lst) .
-
You need not use all the fields in the
SavedInput and Fuzzer classes in your
implementation; you may also track any other additional information in your implementation.
-
There are no "right" implementations for any of these functions, espsecially those in
CoverageGuidedFuzzer . However, the performance of your coverage-guided fuzzer may vary
depending on your implementation. For instance, if you make fuzz_prob and
num_children output a constant value, the performance may be more similar to a random
fuzzer. Some heuristics you may consider:
- Choosing to mutate (have a higher
fuzz_prob ) more recently-discovered inputs; inputs
that have not yet been mutated; inputs that cover more branches
- Choosing to generate more children for fast-executing and short (in length) inputs.
-
For more inspiration, you can check out the technical whitepaper for AFL, or
consider the approaches in different research papers.
Experimental Component.
Now that you have a running random fuzzer and coverage-guided fuzzer, you will evaluate them on a few
benchmarks. For each benchmark and technique, run 3 1-minute experiments. You can run them sequentially or
in parallel; make sure to specify different output directories for each repetition.
-
Run the random and coverage-guided fuzzer on the cgi_decode benchmark for 1 minute each.
- What is the average branch coverage achieved by each technique? How long does it take to achieve
this maximum coverage? Does this vary across runs?
- Does each technique discover the crash? How long does it take for each technique to discover the
crash on average? Does this vary across runs?
-
Run the coverage-guided fuzzer on the cgi_decode benchmark, this time with the
input seeds from the seeds/cgi_decode directory, i.e.:
python3 fuzzer.py benchmarks.cgi_decode <OUTPUT_DIR> --input_dir seeds/cgi_decode --fuzzer COVERAGE --time 60
- What is the average branch coverage achieved by each technique? How long does it take to achieve
this maximum coverage? Does this vary across runs? How does it compare to the coverage-guided fuzzing
without seeds?
- Does the technique discover crashes? How long does it take to discover each crash on
average? How does this compare to coverage-guided fuzzing without seeds?
Now run the random and coverage-guided fuzzer on the cgi_decode_ok_generator and
cig_decode_good_generator benchmarks. These benchmarks to not pass the fuzzer-generated input
directly to the inputs. Instead, they use the fuzzer-generated inputs as a source of randomness for an
"ok" (more likely to generate valid CGI-format strings) and "good" generator (only generates valid
CGI-format strings) of CGI-format strings.
- Does the performance (coverage, time to find crashes) differ very much between random and
coverage-guided fuzzing on these benchmarks? Why do you think the performance does/does not differ?
- How does the performance (coverage, time to find crashes) differ between the "ok" and "good"
generator benchmarks? Why do you think the performance does/does not differ?
- How does the performance (coverage, time to find crashes) of coverage-guided fuzzing on these
benchmarks compare to coverage-guided fuzzing on
cgi_decode without seeds (question 1)?
Why do you think the performance does/does not differ?
- How does the performance (coverage, time to find crashes) compare to
cgi_decode
with seeds (question 2)? Why do you think the performance does/does not differ?
Submission
Submit your assignment via email to clemieux@cs.ubc.ca. Your submission should be a .zip or
.tar.gz archive named [firstname]_[lastname]_CPSC539L_A1.zip (or equivalently,
[firstname]_[lastname]_CPSC539L_A1.tar.gz ).
The archive should contain the following:
-
fuzzer.py : your completed fuzzer.py file
- A writeup file in
.txt , .pdf , or .md format which:
- Describes your implementation decisions in
fuzzer.py . For each of generate_input , fuzz_prob , num_mutants , mutate_input , answer: what implementation did you settle on and why?
- Answers all questions from the experimental component description (1 (a), (b); 2 (a), (b); 3 (a),
(b),
(c), (d))
Bonus Exploration
You do not need to answer these questions to achieve full marks on the assignment.
- Write different implementations of
fuzz_prob , num_mutants , and
mutate_input . Do you get different results than your original implementation?
-
What happens if
mutate_input is not non-deterministic, i.e. always mutates the given input in
a certain way?
- Find an interesting self-contained python program (i.e., most functionality in one file). Write a
test_one_input driver for it and try fuzzing it with the random and coverage-guided fuzzer.
- Our implementation keeps track of coverage even for the random fuzzer. In practice, keeping track of
coverage can reduce the speed of execution. Thus, an advantage of random fuzzing in many scenarios is that
more inputs can be evaluated than with coverage guidance. If it is relatively easy to find bugs with
random fuzzing, coverage-guided fuzzing will often be slower at finding initial inputs. If you modify the
code to remove coverage tracking (
self.cov.start() and self.cov.stop() in
exec_with_coverage ), and save only failing inputs, can you find failing inputs faster with
random fuzzing than coverage-guided fuzzing?
Note
While this is labelled Assignment 1, it is indeed the only assignment for the 2022W1 offering of the
course.
|