Mechanistic Interpretability - Progress and Limits
- đ¤ Speaker: Arthur Conmy (Google DeepMind)
- đ Date & Time: Tuesday 03 March 2026, 16:00 - 17:00
- đ Venue: Lecture Theatre 2, Computer Laboratory, William Gates Building
Abstract
In the broadest sense, mechanistic interpretability refers to explaining neural network behavior in terms of their internal components. We cover early work on vision models, transformer circuits, and automated circuit discovery. We then turn to superposition (what it means mathematically and why we think it occurs in modern transformer language models), the linear representation hypothesis, and sparse autoencoders. Finally, we discuss recent applications in deployed AI systems, and offer a balanced perspective on when mechanistic interpretability is the right tool and when other approaches may be more appropriate as future AI systems get more capable.
Bio: Arthur Conmy is a Senior Research Engineer at Google DeepMind. He produced foundational mechanistic interpretability research, including Interpretability in the Wild (ICLR) and ACDC : Automated Circuit Discovery (NeurIPS 2023), and recently added activation probes to live Gemini deployments to detect misuse.
Series This talk is part of the Artificial Intelligence Research Group Talks (Computer Laboratory) series.
Included in Lists
- All Talks (aka the CURE list)
- Artificial Intelligence Research Group Talks (Computer Laboratory)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Department of Computer Science and Technology talks and seminars
- Guy Emerson's list
- Hanchen DaDaDash
- Interested Talks
- Lecture Theatre 2, Computer Laboratory, William Gates Building
- Martin's interesting talks
- ndk22's list
- ob366-ai4er
- PhD related
- rp587
- School of Technology
- Speech Seminars
- Trust & Technology Initiative - interesting events
- yk373's list
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Arthur Conmy (Google DeepMind)
Tuesday 03 March 2026, 16:00-17:00