University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > A Spark in the Cloud: Iterative and Interactive Cluster Computing

A Spark in the Cloud: Iterative and Interactive Cluster Computing

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Eiko Yoneki.

MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for a wide array of popular use cases including many iterative machine learning algorithms. We present Spark, a framework optimized for iterative jobs, where a dataset is reused across multiple parallel operations without sacrificing the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs) based on the concept of data lineage. Spark provides a functional programming model similar to MapReduce, but also lets users hint for data to be cached between iterations, leading to up to 10x better performance than Hadoop on some jobs. Spark also makes programming jobs easy by integrating cleanly into the Scala programming language (a high-level language on the JVM ). Finally, the ability of Spark to load a dataset into memory and query it repeatedly makes it especially suitable for interactive analysis of big datasets. We have modified the Scala interpreter to make it possible to use Spark interactively in this manner, providing a significantly more responsive experience than Hive and Pig.

Bio: Mosharaf Chowdhury is a Ph.D. student working with Prof. Ion Stoica in the RAD Lab at UC Berkeley. He recieved his B.Sc. in Computer Science and Engineering from Bangladesh University of Engineering and Technology and his M.Math in Computer Science from the University of Waterloo. His research interest is in large-scale data-parallel systems, data center networks, and network virtualization.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2020 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity