Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis

Topics in Programming Languages: Automated Testing, Bug Detection, and Program Analysis

CPSC 539L, Winter Term 1 2024, Instructor: Caroline Lemieux (clemieux@cs.ubc.ca)

Tu/Th 12:30PM-2:00PM, ORCH 4058

Sign up for course Piazza

Office hours by appointment, email Caroline (clemieux@cs.ubc.ca)


Quick Links

FormatScheduleAcademic HonestyResponsesProjectDiscussion Leads

Course description ^

Software bugs remain pervasive in modern software systems. As software becomes increasingly intertwined in all aspects of our lives, the consequences of these bugs become increasingly severe. For instance, since the mid 2010s, several important security vulnerabilities have emerged from basic correctness bugs in software (e.g. Heartbleed, Cloudbleed, Shellshock ). Though not all bugs cause important security vulnerabilities, their overall cost remains high: one estimate puts the cost of dealing with software failures and defects in the United States alone in 2018 at nearly USD$1,540,000,000---the worldwide impact of these defects is even more severe.

While human-written tests are a prevailing method for assuring the quality of software in practice, they are still subject to human error, and allowing bugs to slip through the cracks that the human tester did not consider. This class will explore topics in the field of automated testing, which aims, at a high-level, to reduce the developer burden in discovering bugs and writing test cases. Recent advances in this field leverage static and dynamic information from the program under test in order to effectively explore the space of behaviors of the program under test, enabling the quick discovery of potentially severe security bugs.

The concerns of scalable bug detection and program debugging require innovation from multiple different fronts. As such, this course will cover papers from Software Engineering, Programming Languages, Security, and Systems venues.

Course-level learning goals

After the course, participants should be able to:
  • Describe the problems solved by the following techniques, recognize the scenarios in which each technique can be used, and analyze the pros and cons of each technique, for the following techniques:
    • Test-input generation: blackbox fuzzing, greybox fuzzing, symbolic execution
    • Test oracles: crashing oracles, differential oracles, automated test-oracle generation
    • Property-based testing
    • Whole test-suite generation
    • Pattern-based Static Analysis
    • Fault Localization
    • Dynamic Data Race Detection
  • Identify contributions and room for improvement in existing research papers;
  • Debate the positives and negatives of existing research papers;
  • Categorize and summarize concerns and questions about existing research papers
  • Design and conduct the experimental evaluation of a program testing or analysis tool.

Format ^

This is a seminar-style class, which is primarily driven by in-class discussion of a research paper. Each student is expected to submit a response to the paper being discussed in the class 18 hours ahead of the class in which it is discussed. Each student will also serve as discussion lead for 1 paper; the discussion lead will read through other students' responses and prepare questions to start the discussion.

The bulk of your course mark will come from an open-ended course project. The project is strongly encouraged be conducted as a team of 2-4 students. Students may ask for exceptions to conduct individual projects The project deliverables include an initial proposal, a project check-in, an in-class presentation of the work, and a final report summarizing what has been achieved.

Note this structure is quite distinct from most undergraduate courses which are heavily structured and assessment-driven. The open-ended nature of the responses and the project can be somewhat unsettling if coming straight from heavily structured courses. The goal of this course is to expose students to the field of automated testing, and foster their research abilities, as outlined in the learning goals above.

Grading (Subject to Change)

Attendance at the first day of class (Sep 5) is mandatory and will be counted towards your participation grade. If you are on the waitlist, please attend the first day of class anyway.
  • Paper Responses: 22.5%
  • In-class Participation: 15%
  • Discussion Lead: 7.5%
  • Project Proposal: 12.5%
  • Project Check-in: 2.5%
  • Project Presentation: 10%
  • Project Writeup: 30%

Schedule ^

Note: this schedule is still subject to change.

Tue Sep 3 Imagine UBC Day. No class.
Thu Sep 5 Course Overview, Introductions

Slides from class.

Suggested Reading: Sections 1, 1.1 (not 1.1.1, 1.1.2), 1.2 (not 1.2.1), Caroline's Dissertation.


Optional Reading (do not write a response): How to Read a Paper, Srinivasan Keshav.


Sign up for course piazza: piazza.com/ubc.ca/winterterm12024/cpsc539l.
DL: Caroline
Tue Sep 10 Random (a.k.a. Blackbox) Fuzzing

Barton P. Miller, Mengxiao Zhang, and Elisa R. Heymann. 2022. The Relevance of Classic Fuzz Testing: Have We Solved This One? IEEE Trans. Softw. Eng. 48, 6 (June 2022), 2028–2039.
DL:
Thu Sep 12 Coverage-Guided (Greybox) Fuzzing

Andrea Fioraldi, Alessandro Mantovani, Dominik Maier, and Davide Balzarotti. 2023. Dissecting American Fuzzy Lop: A FuzzBench Evaluation. ACM Trans. Softw. Eng. Methodol. 32, 2, Article 52 (March 2023), 26 pages.

Suggested Reading: Section 2.1, Caroline's Dissertation.
DL:
Tue Sep 17 Property-Based Testing (aka Generator-Based Fuzzing).

Koen Claessen and John Hughes. 2000. QuickCheck: a lightweight tool for random testing of Haskell programs. In Proceedings of the fifth ACM SIGPLAN international conference on Functional programming (ICFP '00). Association for Computing Machinery, New York, NY, USA, 268–279.
DL: Jasper
Thu Sep 19 Combining Coverage-Guided and Generator-Based Fuzzing.

Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. 2019. Semantic fuzzing with zest. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). Association for Computing Machinery, New York, NY, USA, 329–340.
DL:
Tue Sep 24 Symbolic Execution

Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX conference on Operating systems design and implementation (OSDI'08). USENIX Association, USA, 209–224.
DL: Sepehr
Thu Sep 26 Dynamic Symbolic Execution (Concolic Execution)

Koushik Sen, George Necula, Liang Gong, and Wontae Choi. 2015. MultiSE: multi-path symbolic execution using value summaries. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA, 842–853.
DL:
Tue Oct 1 Neural Networks + Fuzzing

D. She, K. Pei, D. Epstein, J. Yang, B. Ray and S. Jana. 2019. NEUZZ: Efficient Fuzzing with Neural Program Smoothing. 2019 IEEE Symposium on Security and Privacy (SP). San Francisco, CA, USA, pp. 803-817.
DL:
Thu Oct 3 Neural Networks + Fuzzing Continued

Maria-Irina Nicolae, Max Eisele, and Andreas Zeller. 2023. Revisiting Neural Program Smoothing for Fuzzing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023). Association for Computing Machinery, New York, NY, USA, 133–145.

Project Proposal Due Oct 6th, 5pm, by email.
DL: Zahin
Tue Oct 8 Beating Magic Bytes.

Cornelius Aschermann, Sergej Schumilo, Tim Blazytko, Robert Gawlik, Thorsten Holz. 2019. REDQUEEN: Fuzzing with Input-to-State Correspondence. In Proceedings of Network and Distributed System SecuritySymposium, NDSS'19.
DL: Fatemeh
Thu Oct 10 Evaluating Randomized Algorithms.

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering (ICSE '11). Association for Computing Machinery, New York, NY, USA, 1–10.
DL:
Tue Oct 15 Whole Test Suite Generation

Carlos Pacheco, Shuvendu K. Lahiri, Michael D. Ernst, and Thomas Ball. 2007. Feedback-Directed Random Test Generation. In Proceedings of the 29th international conference on Software Engineering (ICSE '07). IEEE Computer Society, USA, 75–84.

DL: Pheobe
Thu Oct 17 Test Suite Generation + LLMs

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen. 2023. CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-Trained Large Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE '23). IEEE Press, 919–931.
DL:
Tue Oct 22 Test Oracle Generation

Elizabeth Dinella, Gabriel Ryan, Todd Mytkowicz, and Shuvendu K. Lahiri. 2022. TOGA: a neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering (ICSE '22). Association for Computing Machinery, New York, NY, USA, 2130–2141.
DL:
Thu Oct 24 Project Work Day
Tue Oct 29 Pattern-Based Static Analysis

David Hovemeyer and William Pugh. 2004. Finding bugs is easy. SIGPLAN Not. 39, 12 (December 2004), 92–106.
DL: Gauransh
Thu Oct 31 Type-Annotation Static Analysis

Martin Kellogg, Vlastimil Dort, Suzanne Millstein, and Michael D. Ernst. 2018. Lightweight verification of array indexing. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018). Association for Computing Machinery, New York, NY, USA, 3–14.
DL: Masih
Tue Nov 5 Delta-Debugging

Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and Isolating Failure-Inducing Input. IEEE Trans. Softw. Eng. 28, 2 (February 2002), 183–200.
DL: Sean
Thu Nov 7 Fault Localization.

James A. Jones, Mary Jean Harrold, and John Stasko. 2002. Visualization of test information to assist fault localization. In Proceedings of the 24th International Conference on Software Engineering (ICSE '02). Association for Computing Machinery, New York, NY, USA, 467–477.

Project Check-in Due Nov 8th, 5pm, by email.
DL: Gabriela
Tue Nov 12 Midterm Break. No Class.
Thu Nov 14 Project Work Day
Tue Nov 19 Compiler Testing

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). Association for Computing Machinery, New York, NY, USA, 283–294.
DL: Kjell
Thu Nov 21 Compiler Testing in the Era of LLMs

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE '24). Association for Computing Machinery, New York, NY, USA, Article 126, 1–13.
DL:
Tue Nov 28 Bug Detection in Concurrent Programs

Robert O'Callahan and Jong-Deok Choi. 2003. Hybrid dynamic data race detection. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '03). Association for Computing Machinery, New York, NY, USA, 167–178.
DL: Mayant
Thu Nov 30 Scalable Bug Detection in Concurrent Programs

Guangpu Li, Shan Lu, Madanlal Musuvathi, Suman Nath, and Rohan Padhye. 2019. Efficient scalable thread-safety-violation detection: finding thousands of concurrency bugs during testing. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19). Association for Computing Machinery, New York, NY, USA, 162–180.
DL:
Tue Dec 3 Project presentations
Thu Dec 5 Project presentations

Project Report Due Dec 6th, 5pm, by email.

Academic Honesty ^

You will receive zero on any written submission (paper response, proposal, final write-up) that is found to have plagiarised materials. This includes even a single plagiarized sentence. This includes plagiarism from other students, as well as from other research papers. Refer to this page for more details on the definition of plagiarism, and tips on how to avoid it.

Paper Responses ^

Read each paper to the best of your ability before drafting your paper response. There are two goals to the paper responses:

  • Convey your understanding of the paper's core innovations
  • Convey your opinion of the paper, in order to spark in-class discussion
Structure your paper responses as:
  • A short paragraph summarizing the paper. What is the problem being tackled? How was it adressed by prior work? What are the innovation(s) proposed in this paper? How are those innovations evaluated?
  • Your opinion of the paper, which can be structured in a paragraph or bullet points. Some questions to guide you as you read and write your response:
    • What remains unclear after reading the paper? Are there any clarification questions whose answers would substantially change your opinion of the paper?
    • How does the paper's evaluation match with the proposed problem statement?
    • Do you forsee any barriers to the applicability of the technique proposed in the paper? If so, how could these barriers be overcome?
    • Which technical innovations are most compelling to you?
    • Which problems remain unsolved after this paper?
Remember, the responses help spark in-class discussion. Thus, specificity and depth in your responses is useful. A detailed description of a particular issue raised while reading the paper will be more useful than several high-level comments. For example, if you have doubts about the scalability of a particular technique, detail specific examples that raise those doubts, rather than simply saying "I think the technique will not scale".

Paper responses must be submitted to the course Piazza 22 hours before the class on which they are due. Thus, for Tuesday readings, your paper response should be submitted before 2:30pm on Monday. For Thursday readings, they should be submitted before 2:30pm on Wednesday. This is to give the discussion lead ample time to read the paper responses from other students.

Your paper responses will be submitted as follow-up discussions to each paper's post on the course Piazza. You should not read other students' response before writing your own. After posting your response, you may engage with other students' reponses, e.g. answering a clarification question someone else had.

Project ^

The course project constitutes the majority of the course grade. The project is open-ended; the purpose of the project is to give you the opportunity to conduct research in the field of program analysis, debugging, or automated testing. I recommend you do the project in groups of 2-4; this will give you the opportunity to conduct a project with broader impact.

The scope of acceptable project topics is broad. You are encouraged to find a project with intersects with your own research interests or topic of specialization. But, your project should touch topics in debugging, program analysis, and testing. You may consider a project where you: develop a new tool for testing/analyzing programs in a particular domain, tweak an existing algorithm, conduct an extensive re-evaluation of an existing tool, reimplement an algorithm in a new domain, or create a benchmark suite. If you are not sure if a project is in scope, feel free to set up a meeting with me prior to submitting the project proposal.

Project Ideas. Here are some sample project ideas, of varying complexity and scope. You need not restrict yourself to these ideas. Feel free to ask me for more details on these.

  • Compare the performance of CodaMOSA to a hinting strategy that pulls example code snippets from the existing project code base (rather than invoking an LLM).
  • Front-port the extended test case parsing in CodaMOSA to a more recent version of Pynguin.
  • Run PerfFuzz on a broader benchmark set (i.e. programs not in its evaluation) and evaluate whether the inputs it finds truly represent performance problems.
  • Relatedly: using PerfFuzz (to find algorithmic complexity issues) or FuzzFactory-mem (to find excess memory usages), generate inputs for a few programs with different performance behaviors. Select a power/energy measurement mechanism (see here). Evaluate whether "bad performance" runs correlate to "high energy consumption" runs.
  • Encode full abstraction problems into a fuzzable form: either properties compatible with a property-based-testing framework or with an API fuzzing framework (GraphFuzz or javascript analogue). Evaluate whether the fuzzer can find violations of the encoded full abstraction problems.
  • Build a dataset of oss-fuzz found bugs and fixes. Evaluate an existing program repair tool on this dataset.

Previous Year's Projects. Here are some projects from previous years:

  • Implementing JaCoCo as a coverage backend to JQF.
  • Creating a tool to find NaN poisoning exploits in C/C++. Built on LLVM tooling.
  • Creating an automated testing tool (based on property-based testing) for Haskell Rewrite Rules.

Deliverables

Proposal. Your project proposal should roughly follow this structure, adapting to the nature of your project (new tool vs. replication vs. benchmark creation):

  • Background. What is the problem domain you are investigating? Why is the problem important?
  • Intended approach. What, precisely, are you aiming to build or investigate in the project? What precise problem will you be solving?
  • Evaluation plan. If you are planning on running an experiment, how many benchmarks will you evaluate on? How long will you run your tool(s)?
  • Timeline. Determine a timeline you will follow to achieve your implementation and experimental goals, as well as to deliver the writeup + presentation.

Check-in. The check-in is nearly a month in to the project. It should outline the progress achieved so far. If the timeline's goals have not been reached so far, provide an updated timeline. If, in starting the project, a different research direction has emerged as the more interesting one to follow, the check-in provides a mechanism to update the proposal.

Presentation

Length. TBD, depending on number of groups.

Contents. The final presentation should convey:

  • Background + Motivation Why is this project important?
  • Key details of accomplished approach. If you built a tool/algorithm: what are the key contributions of the tool/algorithm you built? If you are conducting an empirical investigation/case study: what methodology did you adopt?
  • Evaluation Results + Conlcusion. How did you evaluate your approach? What were the results? What are the high-level takeaways of these results?

Grading. You will be graded on (1) communication effectiveness of the presentation; (2) the ability to answer questions, and (3) for groups, the fair division of the presentation amongst group members.

Writeup.

Length. Your final report should be in ACM format. Use the "sigconf" option, i.e. "\usepackage[sigconf]{acmart}" at the top of your LaTeX source. The report should not exceed 10 pages in this format; however, you may have appendices for additional figures/data. As a rough guideline, aim for at least: 5 pages total; 0.75 pages background/intro; 0.5 pages related work. The exact number of pages and length of each section will depend on the nature of your projects. For instance, projects that are improving an existing tool may spend more time talking about the details of the existing tool and the implementation details of an improvement, than on less-closely-related work.

Contents. Your final writeup should cover the following topics.

  • Background. What is the problem domain you are investigating? Why is the problem important?
  • Related work. How does your approach compare and contrast with related approaches? Focus only on the details necessary to contextualize your approach in the literature.
  • Approach. What, precisely, did you build or investigate in the project?
  • Evaluation. What research questions will your evaluation answer? How does your experiment design help answer these questions? What are the results of the experiment, and how do those results inform the research questions?
  • Discussion. Discuss any observations on the experiment which do not fit in the main evaluation, any goals that were not achieved, potential room for improvement.
  • Division of tasks. If you are working in a group, how did you divide work on the project? Include who wrote different sections of the writeup.

Grading. The final report will be graded on several points, including (not necessarily equally weighted):

  • the background provided is sufficient to understand the contribution
  • the related work is critically analyzed in order to contextualize the contribution
  • the overall project approach (as proposed and updated in the check-in) has been accomplished
  • the experimental methodology is sound, the analysis of the experiments has defined takeaways
  • the overall clarity of presentation

Instructions for Discussion Leads ^

You will serve as discussion lead for one paper in the class. As a discussion lead, you do not need to submit a response to the paper being read on Piazza. However, you should read the paper in detail, as you will be the "goto" person for clarification questions during discussion. Read through other students' responses, and identify common themes and interesting questions.

Prepare the following materials.
  • A 10-15 minute summary of the key technical contributions of the paper. You may make a slide deck or use the whiteboard. Caroline will by default project the paper for reference during discussion.
  • Identify 3 common themes, future work suggestions, or points of contention from your reading and the other students' responses. Be prepared to briefly explain these to jump-start discussion.
  • Identify any clarification questions from your own reading of the paper and other students' reponses. If you are not able to fully clarify some of the questions, feel free to point this out! Collect the parts of paper/related work you had difficulty understanding. We can work together as a group to clarify the questions.

This course is loosely based off of Koushik Sen's 2016 offering of CS 294-15.