Making and breaking tokenizers
- π€ Speaker: Sander Land (Writer)
- π Date & Time: Friday 17 October 2025, 12:00 - 13:00
- π Venue: SS03 Hybrid (In-Person + Online). Google Meet: https://meet.google.com/yeu-pqce-rsn
Abstract
Despite massive investments in training large language models, tokenizers remain a critical but often neglected component with weaknesses that can cause wild hallucinations, bypass safety guardrails, and break downstream applications. This talk will cover:
Our recent research in automatically detecting problematic ‘glitch’ tokens in any model
Fundamental issues with pretokenizers and their design
Novel approaches to encodings and pretokenization that address some of these problems.
Speaker Bio Sander Land is a researcher at Writer, previously working at Cohere. He completed his PhD at the Department of Computer Science, University of Oxford, before undertaking a postdoc at Biomedical Engineering, King’s College London, University of London.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- Martin's interesting talks
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- SS03 Hybrid (In-Person + Online). Google Meet: https://meet.google.com/yeu-pqce-rsn
- SS03 Hybrid (In-Person + Online). Google Meet: https://meet.google.com/yeu-pqce-rsn
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Sander Land (Writer)
Friday 17 October 2025, 12:00-13:00