Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis

Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis

CPSC 539L, Winter Term 1 2022, Instructor: Caroline Lemieux (clemieux@cs.ubc.ca)

Mon/Wed 10:30AM-12:00PM, DMP 101, UBC course page

Office hours by appointment, email Caroline


Course description

Software bugs remain pervasive in modern software systems. As software becomes increasingly intertwined in all aspects of our lives, the consequences of these bugs become increasingly severe. For instance, since the mid 2010s, several important security vulnerabilities have emerged from basic correctness bugs in software (e.g. Heartbleed, Cloudbleed, Shellshock ). Though not all bugs cause important security vulnerabilities, their overall cost remains high: one estimate puts the cost of dealing with software failures and defects in the United States alone in 2018 at nearly USD$1,540,000,000---the worldwide impact of these defects is even more severe.

While human-written tests are a prevailing method for assuring the quality of software in practice, they are still subject to human error, and allowing bugs to slip through the cracks that the human tester did not consider. This class will explore topics in the field of automated testing, which aims, at a high-level, to reduce the developer burden in discovering bugs and writing test cases. Recent advances in this field leverage static and dynamic information from the program under test in order to effectively explore the space of behaviors of the program under test, enabling the quick discovery of potentially severe security bugs.

The concerns of scalable bug detection and program debugging require innovation from multiple different fronts. As such, this course will cover papers from Software Engineering, Programming Languages, Security, and Systems venues.

Course-level learning goals

The course will provide an opportunity for participants to:
  • Become familiar with key techniques used in automated testing tools (including concolic execution, coverage-guided fuzz testing, search-based algorithms);
  • Critically evaluate research papers in the field of program testing and analysis;
  • Program automated testing tools in Python;
  • Design and conduct experimental evaluation of a program testing or analysis tool.

Format

This is a seminar-style class, which is primarily driven by in-class discussion of a research paper. Each student is expected to submit a response to the paper being discussed in the class 18 hours ahead of the class in which it is discussed. Each student will also serve as discussion lead for 1 paper; the discussion lead will read through other students' responses and prepare questions to start the discussion.

In addition to readings, there will be a programming assignment, which will give you the opportunity to implement and evaluate some of the test generation algorithms we will explore in papers.

The bulk of your course mark will come from an open-ended course project. The project can be conducted as part of a team. The project deliverables include an initial proposal, an in-class presentation of the work, and a final report summarizing what has been achieved.

Grading (Tentative)

  • Paper Responses: 20%
  • In-class Participation: 10%
  • Discussion Lead: 5%
  • Assignment: 12.5%
  • Project Proposal: 10%
  • Project Check-in: 2.5%
  • Project Presentation: 10%
  • Project Writeup: 30%

Schedule (Tentative)

Note: this schedule is still subject to change.

Wed Sep 7 Course Overview, Introductions

Optional Reading (do not write a response): How to Read a Paper, Srinivasan Keshav.
Sign up for course piazza: piazza.com/ubc.ca/winterterm12022/cpsc539l.

Link to slides from class.

Assignment Released.
DL: Caroline
Mon Sep 12 Random (a.k.a. Blackbox) Fuzzing

Barton P. Miller, Lars Fredriksen, Bryan So. An Empirical Study of the Reliability of UNIX Utilities. Communications of the ACM, 1990.

Optional reading (do not write a response, though may help give you more persepective on the first reading): Update for the 2020s: The Relevance of Classic Fuzz Testing: Have We Solved This One?
DL: Caroline
Wed Sep 14 Coverage-Guided (Greybox) Fuzzing; Generator-Based Fuzzing (Property-based testing).

Lecture; no response required
Optional reading, the seminal paper on "property-based testing": Koen Claessen, John Hughes. QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. ICFP 2000.

Link to slides from class.

DL: Caroline
Mon Sep 19 Queen Elizabeth II's Funeral. No Class.
Wed Sep 21 Information-theoretic approaches to fuzzing.

Marcel Böhme, Valentin J.M. Manès, Sang Kil Cha. Boosting Fuzzer Efficiency: An Information Theoretic Perspective . ESEC/FSE 2020.

Combining Coverage-Guided and Generator-Based Fuzzing.

Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, Yves Le Traon. Semantic Fuzzing with Zest, ISSTA 2019.

Assignment due Friday Sep 23 18:00 PT.
DL (Bohme et al): Zachary
DL (Padhye et al): Yayu
Mon Sep 26 Compiler Fuzzing

Xuejun Yang, Yang Chen, Eric Eide, John Regehr. Finding and Understanding Bugs in C Compilers, PLDI 2011.
DL: Udit
Wed Sep 28 Wrap up discussion on Zest

Intro to Concolic + Symbolic Execution

Time to form project teams, brainstorm

Link to slides from class.

DL: Caroline
Mon Oct 3 Concolic Execution

Koushik Sen, Darko Marinov, Gul Agha. CUTE: A Concolic Unit Testing Engine for C, ESEC/FSE 2005.
DL: Jingqian
Wed Oct 5 Symbolic Execution

Cristian Cadar, Daniel Dunbar, Dawson Engler. KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs, OSDI 2008.
DL: Shizuko
Mon Oct 10 Thanksgiving Day. No class.
Wed Oct 12 Hybrid Fuzzing

Insu Yun, Sangho Lee, Meng Xu, Yeongjin Jang, Taesoo Kim. QSYM : A Practical Concolic Execution Engine Tailored for Hybrid Fuzzing , USENIX Security 2018.
Project Proposal due Friday Oct 14 18:00 PT.
DL:
Mon Oct 17 Evaluating Random Algorithms

Andrea Arcuri, Lionel Briand. A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering, ICSE 2011.

DL: Marie
Wed Oct 19 Random Test Suite Generation

Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, Thomas Ball. Feedback-directed Random Test Generation. ICSE 2007.
DL: Joyce
Mon Oct 24 Evolutionary Test Suite Generation

Gordon Fraser, Andrea Arcuri . Evolutionary Generation of Whole Test Suites . International Conference On Quality Software (QSIC) 2011.

DL: Jifeng
Wed Oct 26 Intro to Taint Tracking

Edward J. Schwartz, Thanassis Avgerinos, David Brumley. All You Ever Wanted to Know About Dynamic Taint Analysis and Forward Symbolic Execution (but might have been afraid to ask). OAKLAND 2010.

DL: Caroline
Mon Oct 31 Taint Tracking

James Newsome, Dawn Song. Dynamic Taint Analysis for Automatic Detection, Analysis, and Signature Generation of Exploits on Commodity Software, NDSS 2005.
DL: ToTo
Wed Nov 2 Taint Tracking for Input Structure Inference

Matthias Höschele, Andreas Zeller. Mining Input Grammars from Dynamic Taints, ASE 2016 NIER.
DL: Kevin
Mon Nov 07 Foundations of Dataflow Analysis

Thomas Reps, Susan Horwitz, and Mooly Sagiv. Precise Interprocedural Dataflow Analysis via Graph Reachability, POPL 1995.
DL: Yanze
Wed Nov 09 Midterm break. No class.
Mon Nov 14 Checking Array Indexing with Dataflow Analysis

Brian Hackett, Manuvir Das, Daniel Wang, Zhe Yang. Modular Checking for Buffer Overflows in The Large, ICSE 2006.
DL: Madonna
Wed Nov 16 Checking Array Indexing with Dependent Types

Martin Kellogg, Vlastimil Dort, Suzanne Millstein, Michael D. Ernst. Lightweight Verification of Array Indexing, ISSTA 2018.

Project Check-in due Friday Nov 18 18:00 PT.
DL: Eric Conlon
Mon Nov 21 Delta Debugging

Andreas Zeller, Ralf Hilderbrandt. Simplifying and Isolating Failure-Inducing Input, IEEE TSE 2002.
DL: Tarcisio
Wed Nov 23 Bug Detection in Concurrent Programs

Robert O'Callahan, Jong-Deok Choi. Hybrid Dynamic Data Race Detection, PPoPP 2003.
DL: Rut
Mon Nov 28 Scalable Bug Detection in Concurrent Programs

Guangpu Li, Shuan Lu, Madanlal Musuthavi, Suman Nath, Rohan Padhye. Efficient Scalable Thread-Safety-Violation Detection, SOSP 2019.
DL: Praveen
Wed Nov 30 Scalable Bug Detection via Static Analysis + Symbolic Execution

Fraser Brown, Deian Stefan, Dawson Engler. Sys: a Static/Symbolic Tool for Finding Good Bugs in Good (Browser) Code, USENIX Security 2020.
DL: Xiangyu
Mon Dec 05 Project presentations
Wed Dec 07 Project presentations

Project Report due Friday December 16 18:00PM PT.

Paper Responses

Read each paper to the best of your ability before drafting your paper response. There are two goals to the paper responses:

  • Convey your understanding of the paper's core innovations
  • Convey your opinion of the paper, in order to spark in-class discussion
Structure your paper responses as:
  • A short paragraph summarizing the paper. What is the problem being tackled? How was it adressed by prior work? What are the innovation(s) proposed in this paper? How are those innovations evaluated?
  • Your opinion of the paper, which can be structured in a paragraph or bullet points. Some questions to guide you as you read and write your response:
    • What remains unclear after reading the paper? Are there any clarification questions whose answers would substantially change your opinion of the paper?
    • How does the paper's evaluation match with the proposed problem statement?
    • Do you forsee any barriers to the applicability of the technique proposed in the paper? If so, how could these barriers be overcome?
    • Which technical innovations are most compelling to you?
    • Which problems remain unsolved after this paper?
Remember, the responses help spark in-class discussion. Thus, specificity and depth in your responses is useful. A detailed description of a particular issue raised while reading the paper will be more useful than several high-level comments. For example, if you have doubts about the scalability of a particular technique, detail specific examples that raise those doubts, rather than simply saying "I think the technique will not scale".

Paper responses must be submitted to the course Piazza 18 hours before the class on which they are do. Thus, for Monday readings, your paper response should be submitted before 2:30pm on Sunday. For Wednesday readings, they should be submitted before 2:30pm on Tuesday.

Your paper responses will be submitted as follow-up discussions to each paper's post on the course Piazza. You should not read other students' response before writing your own. After posting your response, you may engage with other students' reponses, e.g. answering a clarification question someone else had.

Assignment

In the assignment, you will design and evaluate a coverage-guided fuzzer for Python. Refer to the assignment webpage for detailed instructions.

Project

The course project constitutes the majority of the course grade. The project is open-ended; the purpose of the project is to give you the opportunity to conduct research in the field of program analysis, debugging, or automated testing. You may pursue the project alone, though I recommend you do it in groups of 2-4; this will give you the opportunity to conduct a project with broader impact.

The scope of acceptable project topics is broad. You are encouraged to find a project with intersects with your own research interests or topic of specialization. But, your project should touch topics in debugging, program analysis, and testing. You may consider a project where you: develop a new tool for testing/analyzing programs in a particular domain, tweak an existing algorithm, conduct an extensive re-evaluation of an existing tool, reimplement an algorithm in a new domain, or create a benchmark suite. If you are not sure if a project is in scope, feel free to set up a meeting with me prior to submitting the project proposal.

Project Ideas Here are some sample project ideas, of varying complexity (some may be better pursued in groups). You need not restrict yourself to these ideas.

  • Implement a choice-based backend to JQF in order to make RLCheck Guidance runnable on any arbitrary generator.
  • Propose and evaluate a new state/reward framework for RLCheck.
  • Run PerfFuzz on a broader benchmark set (i.e. programs not in its evaluation) and evaluate whether the inputs it finds truly represent performance problems.
  • Collect a benchmark set of software patches. Evaluate the performance of AFLGo, Hawkeye, or other directed testing tools (e.g. KATCH, SHADOW) on your new benchmark set.
  • Develop an algorithm to dynamically infer type signatures from Python calls in order to improve Automated Unit Test Generation that relies on type signatures.
  • Study and discuss the links between novelty search and coverage-guided fuzzing. Construct toy benchmarks on which you can evaluate whether conclusions about novelty search also apply to coverage-guided fuzzing.

Deliverables

Proposal. Your project proposal should roughly follow this structure, adapting to the nature of your project (new tool vs. replication vs. benchmark creation):

  • Background. What is the problem domain you are investigating? Why is the problem important?
  • Intended approach. What, precisely, are you aiming to build or investigate in the project? What precise problem will you be solving?
  • Evaluation plan. If you are planning on running an experiment, how many benchmarks will you evaluate on? How long will you run your tool(s)?
  • Timeline. Determine a timeline you will follow to achieve your implementation and experimental goals, as well as to deliver the writeup + presentation.

Check-in. The check-in is nearly a month in to the project. It should outline the progress achieved so far. If the timeline's goals have not been reached so far, provide an updated timeline. If, in starting the project, a different research direction has emerged as the more interesting one to follow, the check-in provides a mechanism to update the proposal.

Presentation. The final presentation should concisely convey:

  • Background + Problem Context Why is this project important?
  • Key details of accomplished approach. In what you ended up building, what are the key contributions?
  • Evaluation Results. How did you evaluate your approach? What were the results?

Writeup. Your final writeup should cover the following topics. It should be in a format common to and PL/SE conference.

  • Background. What is the problem domain you are investigating? Why is the problem important?
  • Approach. What, precisely, did you build or investigate in the project?
  • Evaluation. What research questions will your evaluation answer? How does your experiment design help answer these questions? What are the results of the experiment, and how do those results inform the research questions?
  • Discussion. Discuss: any observations on the experiment which do not fit in the main evaluation, any goals that were not achieved, potential room for improvement
  • Related work. How does your approach compare and contrast with related approaches?

Instructions for Discussion Leads

You will serve as discussion lead for one paper in the class. As a discussion lead, you do not need to submit a response to the paper being read on Piazza. However, you should read the paper in detail, as you will be the "goto" person for clarification questions during discussion. Read through other students' responses, and identify common themes and interesting questions.

Prepare the following materials.
  • A 5-10 minute summary of the key technical contributions of the paper. You may make a slide deck or use the whiteboard. Caroline will by default project the paper for reference during discussion.
  • Identify 3 common themes, future work suggestions, or points of contention from your reading and the other students' responses. Be prepared to briefly explain these to jump-start discussion.
  • Identify any clarification questions from your own reading of the paper and other students' reponses.

This course is loosely based off of Koushik Sen's 2016 offering of CS 294-15.