Counterargument to CIRL, and Safely Interruptible Agents
- 👤 Speaker: Adrià Garriga Alonso (University of Cambridge)
- 📅 Date & Time: Wednesday 06 December 2017, 17:00 - 18:30
- 📍 Venue: Cambridge University Engineering Department, CBL Seminar room BE4-38. For directions see http://learning.eng.cam.ac.uk/Public/Directions
Abstract
Cooperative Inverse Reinforcement Learning (CIRL) is a game with a robot R and human H, in which R tries to maximise H’s reward while not knowing it. R is incentivised to shut down on H’s suggestion, since that provides information about the H’s reward function. However, Carey (2017) shows that, if R and H do not share the same prior for the reward, R may remain incorrigible. Carey then makes a case for forced interruptibility. We will talk about Carey’s examples and the strength of the case for forced interruptibility.
Orseau and Armstrong (2016) provide a formal notion of satisfactory learning under forced interruptions. Then they show how Q-learning satisfies it, and SARSA and AIXI -with-exploration can be modified to satisfy it. We will go over the proof outlines and discuss their implications for corrigibility.
Reading list:
Ryan Carey. 2017. “Incorrigibility in the CIRL Framework.” arXiv:1709.06275 [cs.AI].
Laurent Orseau and Stuart Armstrong. 2016. “Safely Interruptible Agents.” Paper presented at the 32nd Conference on Uncertainty in Artificial Intelligence.
Slides: https://valuealignment.ml/talks/2017-12-06-interruptibility.pdf
Series This talk is part of the Engineering Safe AI series.
Included in Lists
- Cambridge talks
- Cambridge University Engineering Department, CBL Seminar room BE4-38. For directions see http://learning.eng.cam.ac.uk/Public/Directions
- Chris Davis' list
- Engineering Safe AI
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Wednesday 06 December 2017, 17:00-18:30