Making the Most of Massive Clusters
- 👤 Speaker: Fiodar Kazhamiaka, Stanford
- 📅 Date & Time: Thursday 02 December 2021, 15:00 - 16:00
- 📍 Venue: FW11 and https://cl-cam-ac-uk.zoom.us/j/97216272378?pwd=M2diTFhMTnppckJtNWhFVTBKK0REZz09
Abstract
Resource management systems play an important role in today’s large clusters, allocating jobs/containers to compute resources while balancing metrics like fairness, efficiency, and fault tolerance. Existing management policies in systems such as Kubernetes, VMWare’s DRS , and Red Hat’s OpenShift rely on heuristic-based schedulers which often scale well but are typically sub-optimal. This problem is made worse by the growing trend of heterogeneous clusters—composed of a mix of several generations of CPUs, GPUs, etc. —where existing heuristics perform poorly.
This talk will emphasize the environmental footprint of large resource clusters as a key motivation. I’ll first describe our work on allocating ML training jobs in heterogeneous clusters. A key insight is that many popular scheduling objectives can be cast as mathematical optimization problems whose solutions can maximize cluster efficiency; other systems take a similar approach, for example TetriSched and Facebook’s RAS . However, optimization-based techniques are notorious for scaling poorly to massive systems. To address this issue, I will describe POP : a technique to partition the problem and quickly approximate the optimal allocation. POP reduces solve times by several orders of magnitude with minimal performance loss across a wide range of problem domains, including cluster scheduling and network traffic engineering.
Bio: Fiodar is currently a postdoc fellow at the Stanford Future Data Systems lab, working with Matei Zaharia and Peter Bailis. His research interests span ML systems, energy systems, and data science, with a focus on finding practical solutions to fundamental problems. He obtained his PhD from the University of Waterloo, where his thesis on the optimization of solar panel and battery systems was recognized through the Cheriton Distinguished Dissertation award.
Series This talk is part of the Computer Laboratory Systems Research Group Seminar series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge talks
- Chris Davis' list
- CL's SRG seminar
- Computer Laboratory Systems Research Group Seminar
- Department of Computer Science and Technology talks and seminars
- FW11 and https://cl-cam-ac-uk.zoom.us/j/97216272378?pwd=M2diTFhMTnppckJtNWhFVTBKK0REZz09
- Interested Talks
- ndk22's list
- ob366-ai4er
- rp587
- School of Technology
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Thursday 02 December 2021, 15:00-16:00