Modeling text with Dirichlet compound multinomial distributions
- đ¤ Speaker: Charles Elkan, UCSD
- đ Date & Time: Wednesday 24 January 2007, 14:00 - 15:00
- đ Venue: TCM Seminar Room, Cavendish Laboratory, Department of Physics
Abstract
The Dirichlet compound multinomial (DCM) distribution, also called the multivariate Polya distribution, is a generative model for documents that takes into account burstiness: the fact that if a word occurs once in a document, it is likely to occur repeatedly. I will present a new family of distributions that are close approximations to DCMs and that are an exponential family, unlike DCMs. These so-called EDCM distributions give insight into DCM properties, and lead to an algorithm for EDCM maximum-likelihood training that is 100x faster than the corresponding DCM method. Next, I will discuss expectation-maximization with EDCM components and deterministic annealing as a new clustering algorithm for documents. This algorithm is competitive with the best known methods, and superior from the point of view of finding models with low perplexity. Finally, I will explain the Fisher kernel induced by DCMs and its connection with the well-known TF-IDF heuristic for information retrieval.
Series This talk is part of the Inference Group series.
Included in Lists
- All Cavendish Laboratory Seminars
- All Talks (aka the CURE list)
- Biology
- Cambridge Neuroscience Seminars
- Cambridge talks
- Centre for Health Leadership and Enterprise
- Chris Davis' list
- dh539
- dh539
- Featured lists
- Guy Emerson's list
- Hanchen DaDaDash
- Inference Group
- Inference Group Summary
- Interested Talks
- Joint Machine Learning Seminars
- Life Science
- Life Sciences
- Machine Learning Summary
- ME Seminar
- ML
- Neurons, Fake News, DNA and your iPhone: The Mathematics of Information
- Neuroscience
- Neuroscience Seminars
- Neuroscience Seminars
- Required lists for MLG
- rp587
- School of Physical Sciences
- Stem Cells & Regenerative Medicine
- TCM Seminar Room, Cavendish Laboratory, Department of Physics
- Thin Film Magnetic Talks
- yk373's list
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Wednesday 24 January 2007, 14:00-15:00