BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Modeling text with Dirichlet compound multinomial distributions - 
 Charles Elkan\, UCSD
DTSTART:20070124T140000Z
DTEND:20070124T150000Z
UID:TALK6181@talks.cam.ac.uk
CONTACT:Christian Steinruecken
DESCRIPTION:The Dirichlet compound multinomial (DCM) distribution\, also c
 alled the multivariate Polya distribution\, is a generative model for docu
 ments that takes into account burstiness: the fact that if a word occurs o
 nce in a document\, it is likely to occur repeatedly.  I will present a ne
 w family of distributions that are close approximations to DCMs and that a
 re an exponential family\, unlike DCMs.  These so-called EDCM distribution
 s give insight into DCM properties\, and lead to an algorithm for EDCM max
 imum-likelihood training that is 100x faster than the corresponding DCM me
 thod.  Next\, I will discuss expectation-maximization with EDCM components
  and deterministic annealing as a new clustering algorithm for documents. 
 This algorithm is\ncompetitive with the best known methods\, and superior 
 from the point of view of finding models with low perplexity.  Finally\, I
  will explain the Fisher kernel induced by DCMs and its connection with th
 e well-known TF-IDF heuristic for information retrieval.\n
LOCATION:TCM Seminar Room\, Cavendish Laboratory\, Department of Physics
END:VEVENT
END:VCALENDAR
