University of Cambridge > Talks.cam > Signal Processing and Communications Lab Seminars > Data sketching for cardinality and entropy estimation

Data sketching for cardinality and entropy estimation

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Rachel Fogg.

Streaming data is ubiquitous in a wide range of areas from engineering and information technology, finance, and commerce, to atmospheric physics, and earth sciences. The online approximation of properties of data streams is of great interest, but this approximation process is hindered by the sheer size of the data and the speed at which it is generated. Data stream algorithms typically allow only one pass over the data, and maintain sub-linear representations of the data from which target properties can be inferred with high efficiency. In this talk we consider the online approximation of two important characterizations of data streams: cardinality and empirical Shannon entropy. We assume that the number of distinct elements observed in the stream is prohibitively large, so that the vector of cumulative quantities cannot be stored on main computer memory for fast and efficient access. We focus on two techniques that use pseudo-random variates to form low-dimensional data sketches (using hashing and random projections), and derive estimators of the cardinality and empirical entropy. We discuss various properties of our estimators such as relative asymptotic efficiency, recursive computability, and error and complexity bounds. Finally, we present results on simulated data and seismic measurements from a volcano.

References: 1. Peter Clifford and Ioana A. Cosma (2011) “A statistical analysis of probabilistic counting algorithms” (to appear in the Scandinavian Journal of Statistics, preprint on arXiv:0801.3552). 2. Peter Clifford and Ioana A. Cosma (2009)“A simple sketching algorithm for entropy estimation” (in preparation, preprint on arXiv:0908.3961).

This talk is part of the Signal Processing and Communications Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity