University of Cambridge > Talks.cam > Computer Laboratory Systems Research Group Seminar > Architectures for large-scale continuous data management

Architectures for large-scale continuous data management

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Eiko Yoneki.

The ability to do rich analytics on massive sets of unstructured data drives the operation of many organizations today. These “big data” analytics have given rise to a new class of data-intensive computing systems, like MapReduce, that can scale to very large data simply by employing more compute power. While these systems have been very successful, it is becoming apparent that scalability alone is not enough. Many analytics today are update-driven, and this brute-force approach is inefficient when trying to keep analytics up-to-date as data change continuously.

In the first part of the talk, I will present a new approach for programming analytics that takes the continuous nature of data into consideration. A fundamental requirement for efficient processing of continuous data is the ability to incrementally update the analytics by maintaining computation state. I will argue that state should be a first-class abstraction and present Continuous Bulk Processing (CBP), a model and architecture that integrates data-parallelism for scalability with state for efficient update-driven analytics. The model lends itself to several analytics, like incremental algorithms and iterative analysis. Through real-world applications, I will show how the integration of state in the programming model affords several optimizations in the underlying system, reducing processing time and resource usage relative to current practice.

While integrating state in the programming model allows efficient incremental programs, it may be challenging to design incremental algorithms for complex analytics, like iterative graph mining and machine learning. In the second part, I will talk about ongoing work on a system that can incrementally compute this class of analytics in a manner that is transparent to the user.

Bio: I am an Associate Researcher with the Telefonica Research lab in Barcelona, Spain. I am primarily interested in building systems for large-scale data mining. My broader research interests lie in the areas of data management, cloud computing and distributed systems. I received my PhD in Computer Science from the University of California, San Diego and Diploma in Computer Science & Engineering from the National Technical University of Athens, Greece.

This talk is part of the Computer Laboratory Systems Research Group Seminar series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2020 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity