BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Data Selection for Pre-training and Instruction-tuning of LLMs - D
 anqi Chen\, Princeton University
DTSTART:20240516T130000Z
DTEND:20240516T140000Z
UID:TALK215494@talks.cam.ac.uk
CONTACT:Panagiotis Fytas
DESCRIPTION:There is increasing evidence that choosing the right training 
 data is essential for producing state-of-the-art large language models (LL
 Ms). How can we decide on high-quality training data? Can we possibly sele
 ct fewer data examples to improve performance and efficiency? In this talk
 \, I will present two recent works on selecting high-quality data in pre-t
 raining and instruction tuning. I will first present QuRating\, a simple f
 ramework for selecting pre-training data that captures the abstract attrib
 utes of texts humans intuitively perceive. We demonstrate that using state
 -of-the-art LLMs (e.g.\, GPT-3.5-turbo) can discern these qualities in pai
 rwise judgments and emphasize the importance of balancing quality and dive
 rsity. We have created QuRatedPajama\, a dataset comprising 260 billion to
 kens with fine-grained quality ratings\, and show that sampling according 
 to these ratings improves perplexity and in-context learning. Second\, I p
 resent LESS\, a method that effectively estimates data influences for iden
 tifying relevant instruction-tuning data for specific applications (a sett
 ing we call “targeted instruction tuning”). LESS is efficient\, transf
 errable (we can use a smaller model for data selection)\, optimizer-aware 
 (working with Adam)\, and easy to interpret. We show that training on a LE
 SS-selected 5% of the data can often outperform training on full datasets 
 on diverse downstream tasks.\n
LOCATION:https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBd
 XVpOXFvdz09
END:VEVENT
END:VCALENDAR
