CPSC 528L: Software Engineering Tools in the Era of LLMs

CPSC 528L, Winter Term 2 2026 (Jan-Apr 2027), Instructor: Caroline Lemieux (clemieux@cs.ubc.ca)

Quick Links

Course Description •Format • Schedule • Academic Honesty

Course description ^{^}

Software engineering tools exist to help developers write correct, performant, and secure software. When we think of concrete tools used in software engineering, perhaps the most familiar is the IDE. IDEs encapsulate a variety of different software engineering tools: code completion tools, debuggers, static analyses that detect errors before running, and many more.

Software engineering tooling research aims to push the bounds on which software development tasks can be automated, and to which extent. And there are many different tasks to tackle...

Code completion tools aim to predict the next character, word, or line, a developer intends to write.
Automated test generation aims to, well, automatically test programs: be it to find inputs that expose bugs a given program, or to generate whole test suites for a given program.
Vulnerability detection tools work by either by generating inputs showing security flaws, or labelling sections of code as particularly susceptible to exploitation.
Profilers and debuggers analyze code as it runs, to give developers richer information about code execution, which will help them identify and fix bugs.
Program repair tools aim to, given some issue report or failing tests, report a fixed programs.
Code-documentation inconsistency detection tools aim to point out to developers where their code and documentation differs, and sometimes suggest how to fix it.
Specification mining work aims to analyze code, or code executions, to extract higher-level properties (specifications) about the program being analyzed: developers can then analyze these to see whether they match their expectations.
Program synthesis goes the other way: given specifications (either in a formal language, in the form of test cases, or natural language) it aims to generate code that matches these specification.

This list is not exhaustive. The point is: software engineering tooling research is a broad field. Innovations have come from various different researchers in many different fields, including Software Engineering, Programming Languages, Security, Systems, and Machine Learning.

With the advent of Large Language Models, which have improved code completion by leaps and bounds, we are seeing more software engineering tool work pop up in machine learning venues. In addition, the flexibility of LLMs at generating code gives Software Engineering researchers an incredibly powerful building block to use in software engineering tools; a huge percentage of work in Software Engineering is now pushing the bounds of where and how LLMs can improve the performance of software engineering tooling.

With this excitement comes a flood of research. With this flood of research comes many questions: What are the limits of software engineering tools? How do we evaluate the significance of work in which prompt engineering is a major part of the research? What is the relevance of pre-LLM work, if it performs worse on end metrics? Where does the potential for impact lie?

This course aims to answer those questions, and more. The course will introduce students to Software Engineering Tooling research. We will cover the theoretical limits that constrain the performance of software engineering tools. We will cover research norms in software engineering: how significance and impact are evaluated by the field. We will read papers that established research in, greatly improved the performance of, or enlivened various software engineering tools. Each paper-reading week, we will pair one such classic paper with a recent LLM-driven work in its domain.

This course is suitable for students unfamiliar with software engineering research. It is a complement to CPSC 507, which focuses on research methods in software engineering, especially human factors.

Course-level learning goals (Subject to Change)

After the course, participants should be able to:

Explain the theoretical limits (Rice's theorem, SAT being NP-complete) that constrain the performance of software engineering tools
Describe the problems solved by the following techniques, recognize the scenarios in which each technique can be used, and analyze the pros and cons of each technique, for the following techniques:
- (See Weekly Topics Below)
Identify contributions and room for improvement in existing research papers;
Debate the positives and negatives of existing research papers;
Categorize and summarize concerns and questions about existing research papers
Design and conduct the experimental evaluation of a program testing or analysis tool;

Format ^{^}

This is a seminar-style class, which is primarily driven by in-class discussion of a research paper. Each student is expected to submit a response to the paper being discussed in the class the day before the class in which it is discussed. Each student will also serve as discussion lead for 1-2 papers; the discussion lead will read through other students' responses and prepare questions to start the discussion. Around half your course mark comes from participating in discussions and submitting paper responses.

The other half of your course mark will come from an open-ended course project. The project is meant to be a small investigation into LLMs for Software Engineering tooling. It is not expected to be a full (publishable) research project. Instead, the goal is to gain a deeper---i.e.,---understanding of the performance of SE tools on a task that interests you. A few directions of inquiry include:

Comparing SE tools (either LLM or non-LLM) to the performance of frontier models without specific prompting.
Qualitatively analyzing the results of an LLM-based SE tool which was only quantitatively analyzed.
Running an SE tool in a setting beyond the one it was originally evaluated on (e.g., larger/different code bases, new data).
Analyzing performance differences of an LLM-based SE tool when run with different LLMs.
Comparing an LLM-based SE tool to a non-LLM tool it was not previously compared with.
... you are welcome to discuss with Caroline any ideas you have that are not listed here.

I will populate the list below with interesting sources of benchmarks as semester gets closer.

RubberDuckBench: A small set of code understanding questions, with thorough rubrics.
Defects4J: A classic dataset of bugs and bug fixes in Java (likely in LLMs' training Set)
BugsInPy: A Defects4J-inspired dataset of bugs and bug fixes in Python (likely in LLMs' training Set)

The schedule will also be updated with precise papers you can use as inspiration.

It is strongly encouraged to conduct the project in a team of 2-4 students. This usually results in more interesting projects, and less work for each individual student.

The project deliverables include an initial proposal, a project check-in, an in-class presentation of the work, and a final report summarizing what has been achieved.

Grading (Subject to Change)

Paper Responses: 10%
In-class Participation: 20%
Discussion Lead: 15%
Project Proposal: 8%
Project Check-in: 2%
Project Presentation: 20%
Project Writeup: 25%

Schedule (Subject to Change)^{^}

Week	Topic
Jan 4	Introduction, Theoretical Limits of Software Engineering Tooling (Lecture)
Jan 11	A Case Study in Scale vs. Provability: Fuzzing vs Symbolic Execution (Lecture)
Jan 18	Code Completion (Papers)
Jan 25	Automated Program Repair (Papers)
Feb 1	Basics of Program Analysis (Lecture)
Feb 8	Program Synthesis (i.e., Code Generation) (Papers)
Feb 15	No Class. Reading Week.
Feb 22	Student Project Pitches
Mar 1	Unit Test Generation (Papers)
Mar 8	Compiler Testing (Papers)
Mar 15	Specification Mining (Papers)
Mar 22	Code-Documentation Inconsistency Detection (Papers)
Mar 29	Vulnerability Detection (Papers)
Apr 5	Project Presentations
Apr 12	No Class

Academic Honesty ^{^}

You will receive zero on any written submission (paper response, proposal, final write-up) that is found to have plagiarised materials. This includes even a single plagiarized sentence. This includes plagiarism from other students, as well as from other research papers. Refer to this page for more details on the definition of plagiarism, and tips on how to avoid it.

There is zero-tolerance for the use of GenAI to generate written materials for the course, including paper responses, project proposals, etc. I.e., writing submitted for the course should be generated by a human. I will always favour short, clear, human-written text, over long amounts of GenAI-written text. You are free to use GenAI for coding tasks in your course project; however, you will need to be able to assess whether the coding task was implemented correctly. Using GenAI to spellcheck work is ok.

Why no GenAI for the paper responses? It is because the goal of writing paper responses is not to provide us with yet-another summary of a paper. It is instead to provide (a) the response-writer with the opportunity to critically examine what parts of the paper they have understood; and (b) the response-readers (discussion leads and instructor) with an understanding of which parts of the paper need extra elucidation/joint discussion. If the paper responses are generated by passing the paper to GenAI, neither of these goals will be accomplished.