BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Nash and Nemirovski walk into a bar: LLM alligment with Mirror Des
 cent and Proximal Methods - Michal Valko (INRIA Lille - Nord Europe Resear
 ch Centre)
DTSTART:20251111T140000Z
DTEND:20251111T144000Z
UID:TALK238540@talks.cam.ac.uk
DESCRIPTION:Traditional Reinforcement Learning from Human Feedback typical
 ly relies on reward models and preference structures such as the Bradley&n
 dash\;Terry model. While effective in some cases\, these assumptions fail 
 to capture the richness of human preferences\, which often exhibit phenome
 na such as intransitivity. In this talk\, we present Nash Learning from Hu
 man Feedback\, a more direct alternative that frames the problem as findin
 g a Nash equilibrium in a game induced by human preferences. This perspect
 ive provides a principled way to model complex\, potentially non-transitiv
 e preferences without the need to introduce a reward model. We will survey
  methods for approximating Nash equilibria in this setting\, with a focus 
 on fine-tuning large language models. In particular\, we show how (approxi
 mate) proximal optimization methods&mdash\;notably the NashMD and then Mir
 ror Prox algorithm&mdash\;can be adapted to achieve fast and stable conver
 gence in this setting. Finally\, we discuss practical strategies for effic
 iently implementing these approximate proximal methods in large-scale trai
 ning.
LOCATION:Seminar Room 1\, Newton Institute
END:VEVENT
END:VCALENDAR
