2022 W1 CPSC 539L: Assignment 1

2022 W1 CPSC 539L: Assignment 1

CPSC 539L, Winter Term 1 2022, Instructor: Caroline Lemieux (clemieux@cs.ubc.ca)

Mon/Wed 10:30AM-12:00PM, DMP 101, UBC course page

Back to main course webpage


Quick jump: Background * Instructions * Submission

In this assignment you are provided with the skeleton of a random and coverage-guided fuzz testing tool for small Python programs. This assignment has a programming component in which you will implement the core heuristics of the fuzz testing tools. It also has an experimental component in which evaluate the tools you have built on some small benchmark programs. For the experimental component, it is recommended that you run each designated technique on each benchmark 3 times, for 1 minute each. This means a total of about 20 minutes of experiment runtime if all experiments are run sequentially.

Learning Goals

  • Construct heuristics for the core of a random and a coverage-guided fuzzer
  • Identify when coverage-guided fuzzing might yield better performance than random fuzzing and vice-versa
  • Analyze the effect of input seeds on the performance of coverage-guided fuzzing
  • Compare the performance of structured fuzzing with different levels of structure specification

Questions

Ask questions, report potential typos, etc., about the assignment on the public class Piazza as much as possible so other students can benefit from the clarifications.

Background

In class we will read two papers which cover the basics of random, coverage-guided, and generator-based fuzzing. The September 14th lecture will also cover these topics. The following recaps the core background.

Random Fuzzing

See also the description of "fuzz" in the original fuzzing paper. A random fuzzer simply produces totally random inputs and sends these to the program under test for the duration of the fuzzing session.

The pseudo-code for random fuzzing is as follows:

def random_fuzz(program_to_test, max_search_time):
    start_time = current_time()
    while (current_time() - start_time < max_search_time):
        new_input = generate_random_input()
        has_error, input_coverage = execute_on_input(program_to_test, new_input)
        if (has_error):
            save_as_error(new_input)
It is not necessary to keep track of coverage achieved while performing random fuzzing. Our implementation does this is so you can compare its coverage to that of coverage-guided fuzzing (see the discussion in ``Bonus Exploration'').

Coverage-Guided Fuzzing

See also Section 2.2 of the Zest paper. The core algorithm of coverage-guided fuzzing you will be implementing is the following:

def coverage_guided_fuzz(program_under_test, seed_inputs, max_search_time):
    start_time = current_time()
    saved_inputs = set(seed_inputs)
    while (current_time() - start_time < max_search_time):
       for saved_input in saved_inputs:
          if random_float() < FUZZ_PROB(saved_input):
             num_mutants = NUM_MUTANTS(saved_input)
             for _ in range(0, num_mutants):
                mutated_input = MUTATE_INPUT(saved_input)
                has_error, input_coverage  = execute_on_input(
                      mutated_input
                )
                if (has_error):
                   save_as_error(mutated_input)
                elif (has_new_coverage(input_coverage)):
                   saved_inputs.add(mutated_input)
 
The real implementation in fuzzer.py includes some additional statements for statistics tracking purposes, as well as looping over saved_inputs that is safe to the expansion of saved_inputs.

In a nutshell, the algorithm starts from a set of seed inputs (or some arbitrarily-generated input if no seed input is provided). These populate the set saved_input. Coverage-guided fuzzing produces new inputs by repeatedly mutating the inputs in saved_inputs. Any new mutated inputs which are found to increase code coverage are added to saved_inputs, and the process repeats.

Three core heuristics control the performance of coverage-guided fuzzing:

  • We may not always want to mutate all the inputs in saved_inputs, especially as the size of saved_inputs grows. FUZZ_PROB, aka the fuzz_prob method in CoverageGuidedFuzzer, controls the likelihood of selecting an input in saved_inputs for mutation.
  • Mutating + running mutated inputs is time-consuming. So we may not want to produce as many mutated inputs from each input selected for mutation. NUM_MUTANTS, aka the num_mutants method in CoverageGuidedFuzzer, controls how many mutants we will generate from a given parent input.
  • Above we refer to the process of mutating inputs, but how exactly should this be done? Small mutations allow targeted exploration near a particular input; larger mutations may allow for faster exploration of different behaviors. MUTATE_INPUT, aka the mutate_input method in CoverageGuidedFuzzer defines how to mutate a particular input. This method should certainly be non-deterministic.

In practice, other aspects of the algorithm, such as deciding when to save an input, can also affect the performance of fuzzing. For the purposes of this assignment, we will consider these fixed.

Generator-Based Fuzzing

See also Section 2.1 of the Zest paper.

At its conceptual core, generator-based fuzzing is simply random fuzzing, where, rather than generating a random input from scratch, a provided input generator is called to generate inputs:

def generator_fuzz(program_to_test, max_search_time, input_generator):
    start_time = current_time()
    while (current_time() - start_time < max_search_time):
        new_input = input_generator()
        has_error, input_coverage = execute_on_input(program_to_test, new_input)
        if (has_error):
            save_as_error(new_input)

You will not be implementing a generator-based fuzzer; the benchmarks cgi_decode_good_generator.py and cgi_decode_ok_generator.py build the generator into the test driver. For instance, the test_one_input function (i.e., the program under test) defined in cgi_decode.py is:

def test_one_input(input_data: bytes):
    try:
        input_str = input_data.decode("UTF-8")
        cgi_decode(input_str)
    except ValueError:
        # Invalid input, but not a bug
        pass

While in cgi_decode_ok_generator.py it is:

def test_one_input(input_data: bytes):
    try:
        input_str = generate_cgi_string(input_data)
        cgi_decode(input_str)
    except ValueError:
        # Invalid input, but not a bug
        pass

The generator generate_cgi_string uses the provided sequence of bytes to build the input, rather than polling a random source to make decisions about input generation. Refer to the implementation in generators/cgi_string.py for how exactly this works. In the Zest paper, we refer to these generators as parametric generators.

Code Coverage

Line coverage refers to the lines executed at least once by the program under test. Branch coverage refers to whether both "sides" of an if statement (or while statement, or switch statement...) are covered.

For example, consider the program foo here:

def foo(x, y):
  z = 2 * x
  if z > y:
      z = y
  return z + y

  • The call foo(3,2) gets full line-coverage of the program: all lines are executed. It does not get full branch coverage, since only the true side of if z > y: is executed.
  • The call foo(3,7) misses one line (z = y) so does not get full line coverage. It does not get full branch coverage, since only the false side of if z > y: is executed.
  • Together, the calls to foo(3,2) and foo(3,7) achieve full line + branch coverage of the program.

In this assignment, we use the coverage.py library to measure coverage. The fuzzer only considers branch (aka "edge" or "arc") coverage as "new coverage", but the produced html coverage reports also show you the line coverage of the program under test. In the coverage.py html report, lines covered are highlighted green; lines not covered are highlighted red red. Lines containing a branch which is not fully covered (i.e., only the true or false case is covered, not both) are in yellow.

Instructions

Collaboration policy

You may discuss the assignment with other students but all coding and the writeup should be done on your own.

Startup

You will need a machine that runs Python 3.8+. One external package is needed to run the fuzzing, tool, coverage.py. You should be able to install this with pip (either pip install converage or pip3 install coverage depending on your system); see instructions here.

Download the startup code here: cpsc539l-assignment-materials.zip. After unpacking the archive, you will be left with a directory with the following structure:

  • fuzzer.py: this is the file in which you will complete your implementation tasks
  • benchmarks: this folder contains several benchmarks on which you can run your fuzzer
    • quicksort.py: an implementation of quicksort, you can use this to quickly run
    • cgi_decode.py: a decoder of CGI-encoded strings, which accepts the inputs directly generated by the fuzzer.
    • cgi_decode_ok_generator.py: a decoder of CGI-encoded strings, where the fuzzer-generated bytes are used to generate a string with valid CGI characters, which is then sent to the decoder. This is effectively a generator-based fuzzer with a very simple generator.
    • cgi_decode_good_generator.py: a decoder of CGI-encoded strings, where the fuzzer-generated bytes are used to generate a valid CGI string, which is then sent to the decoder. This is effectively a generator-based fuzzer with a generator which only creates valid inputs for the CGI decoder.
  • generators: this folder contains generators which convert a fuzzer-generated bytestring into more structured data
    • cgi_str.py: contains the generator for valid CGI strings (used in the "good" benchmark), and the generator for strings with valid CGI characters (used in the "ok" benchmark).
  • seeds: this folder contains an input file to seed coverage-guided fuzzing with on the cgi_decode benchmark

The core code of the cgi_decode benchmark is taken from the Fuzzing Book.

Programming Component

You have four functions to fill in. After filling these functions in you should be able to run (a) your random fuzzer and (b) your coverage guided fuzzer on the provided benchmarks.

  1. Random fuzzer: fill in the generate_input method in the RandomFuzzer class in order to complete the random fuzzer.

  2. Coverage-guided fuzzer: there are 3 methods to fill in the CoverageGuidedFuzzer class:

    1. fuzz_prob(input): given a saved input, should we use it as a parent for mutation? If fuzz_prob returns 1, we will always use it as a parent for mutation; if it returns 0, it will not be used as a parent for mutation.
    2. num_mutants(input): given a saved input selected for mutation, how many "children" (or "mutant") inputs should we create from it?
    3. mutate_input(input_data): given a byte-string input, create a new byte-string input from it. Note that this method should definitely be non-deterministic, i.e. not always return the same mutant.

You can test your random fuzzer on the quicksort benchmark as follows:

 
  $ python3 fuzzer.py benchmarks.quicksort <OUTPUT_DIR_NAME> --fuzzer RANDOM --time 10
    
and you can test your coverage-guided fuzzer on the quicksort benchmark as follows:

  $ python3 fuzzer.py benchmarks.quicksort <OUTPUT_DIR_NAME> --fuzzer COVERAGE --time 10
              

Describe the final implementation you chose for each of these four functions, and why you chose this implementation, in your assignment writeup.

Hints

  • Our fuzzers produce Python bytestring inputs, rather than regular python strings. You can construct a python bytes object from a list of integers lst, where each x in the list is 0 <= x < 255, with the function bytes(), i.e.: bytes(lst).
  • You need not use all the fields in the SavedInput and Fuzzer classes in your implementation; you may also track any other additional information in your implementation.
  • There are no "right" implementations for any of these functions, espsecially those in CoverageGuidedFuzzer. However, the performance of your coverage-guided fuzzer may vary depending on your implementation. For instance, if you make fuzz_prob and num_children output a constant value, the performance may be more similar to a random fuzzer. Some heuristics you may consider:
    • Choosing to mutate (have a higher fuzz_prob) more recently-discovered inputs; inputs that have not yet been mutated; inputs that cover more branches
    • Choosing to generate more children for fast-executing and short (in length) inputs.
  • For more inspiration, you can check out the technical whitepaper for AFL, or consider the approaches in different research papers.

Experimental Component.

Now that you have a running random fuzzer and coverage-guided fuzzer, you will evaluate them on a few benchmarks. For each benchmark and technique, run 3 1-minute experiments. You can run them sequentially or in parallel; make sure to specify different output directories for each repetition.

  1. Run the random and coverage-guided fuzzer on the cgi_decode benchmark for 1 minute each.

    1. What is the average branch coverage achieved by each technique? How long does it take to achieve this maximum coverage? Does this vary across runs?
    2. Does each technique discover the crash? How long does it take for each technique to discover the crash on average? Does this vary across runs?
  2. Run the coverage-guided fuzzer on the cgi_decode benchmark, this time with the input seeds from the seeds/cgi_decode directory, i.e.: python3 fuzzer.py benchmarks.cgi_decode <OUTPUT_DIR> --input_dir seeds/cgi_decode --fuzzer COVERAGE --time 60

    1. What is the average branch coverage achieved by each technique? How long does it take to achieve this maximum coverage? Does this vary across runs? How does it compare to the coverage-guided fuzzing without seeds?
    2. Does the technique discover crashes? How long does it take to discover each crash on average? How does this compare to coverage-guided fuzzing without seeds?
  3. Now run the random and coverage-guided fuzzer on the cgi_decode_ok_generator and cig_decode_good_generator benchmarks. These benchmarks to not pass the fuzzer-generated input directly to the inputs. Instead, they use the fuzzer-generated inputs as a source of randomness for an "ok" (more likely to generate valid CGI-format strings) and "good" generator (only generates valid CGI-format strings) of CGI-format strings.

    1. Does the performance (coverage, time to find crashes) differ very much between random and coverage-guided fuzzing on these benchmarks? Why do you think the performance does/does not differ?
    2. How does the performance (coverage, time to find crashes) differ between the "ok" and "good" generator benchmarks? Why do you think the performance does/does not differ?
    3. How does the performance (coverage, time to find crashes) of coverage-guided fuzzing on these benchmarks compare to coverage-guided fuzzing on cgi_decode without seeds (question 1)? Why do you think the performance does/does not differ?
    4. How does the performance (coverage, time to find crashes) compare to cgi_decode with seeds (question 2)? Why do you think the performance does/does not differ?

Submission

Submit your assignment via email to clemieux@cs.ubc.ca. Your submission should be a .zip or .tar.gz archive named [firstname]_[lastname]_CPSC539L_A1.zip (or equivalently, [firstname]_[lastname]_CPSC539L_A1.tar.gz).

The archive should contain the following:

  • fuzzer.py: your completed fuzzer.py file
  • A writeup file in .txt, .pdf, or .md format which:
    • Describes your implementation decisions in fuzzer.py. For each of generate_input, fuzz_prob, num_mutants, mutate_input, answer: what implementation did you settle on and why?
    • Answers all questions from the experimental component description (1 (a), (b); 2 (a), (b); 3 (a), (b), (c), (d))

Bonus Exploration

You do not need to answer these questions to achieve full marks on the assignment.

  • Write different implementations of fuzz_prob, num_mutants, and mutate_input. Do you get different results than your original implementation?
  • What happens if mutate_input is not non-deterministic, i.e. always mutates the given input in a certain way?
  • Find an interesting self-contained python program (i.e., most functionality in one file). Write a test_one_input driver for it and try fuzzing it with the random and coverage-guided fuzzer.
  • Our implementation keeps track of coverage even for the random fuzzer. In practice, keeping track of coverage can reduce the speed of execution. Thus, an advantage of random fuzzing in many scenarios is that more inputs can be evaluated than with coverage guidance. If it is relatively easy to find bugs with random fuzzing, coverage-guided fuzzing will often be slower at finding initial inputs. If you modify the code to remove coverage tracking (self.cov.start() and self.cov.stop() in exec_with_coverage), and save only failing inputs, can you find failing inputs faster with random fuzzing than coverage-guided fuzzing?

Note

While this is labelled Assignment 1, it is indeed the only assignment for the 2022W1 offering of the course.