Data Selection for Pre-training and Instruction-tuning of LLMs
- 👤 Speaker: Danqi Chen, Princeton University
- 📅 Date & Time: Thursday 16 May 2024, 14:00 - 15:00
- 📍 Venue: https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
Abstract
There is increasing evidence that choosing the right training data is essential for producing state-of-the-art large language models (LLMs). How can we decide on high-quality training data? Can we possibly select fewer data examples to improve performance and efficiency? In this talk, I will present two recent works on selecting high-quality data in pre-training and instruction tuning. I will first present QuRating, a simple framework for selecting pre-training data that captures the abstract attributes of texts humans intuitively perceive. We demonstrate that using state-of-the-art LLMs (e.g., GPT -3.5-turbo) can discern these qualities in pairwise judgments and emphasize the importance of balancing quality and diversity. We have created QuRatedPajama, a dataset comprising 260 billion tokens with fine-grained quality ratings, and show that sampling according to these ratings improves perplexity and in-context learning. Second, I present LESS , a method that effectively estimates data influences for identifying relevant instruction-tuning data for specific applications (a setting we call “targeted instruction tuning”). LESS is efficient, transferrable (we can use a smaller model for data selection), optimizer-aware (working with Adam), and easy to interpret. We show that training on a LESS -selected 5% of the data can often outperform training on full datasets on diverse downstream tasks.
Series This talk is part of the Language Technology Lab Seminars series.
Included in Lists
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Guy Emerson's list
- https://cam-ac-uk.zoom.us/j/97599459216?pwd=QTRsOWZCOXRTREVnbTJBdXVpOXFvdz09
- Interested Talks
- Language Sciences for Graduate Students
- Language Technology Lab Seminars
- ndk22's list
- ob366-ai4er
- rp587
- Simon Baker's List
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Danqi Chen, Princeton University
Thursday 16 May 2024, 14:00-15:00