BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Talks.cam//talks.cam.ac.uk//
X-WR-CALNAME:Talks.cam
BEGIN:VEVENT
SUMMARY:Learning Bigrams from Unigrams - Andrew B. Goldberg (University of
  Wisconsin\, Madison)
DTSTART:20080923T130000Z
DTEND:20080923T140000Z
UID:TALK13140@talks.cam.ac.uk
CONTACT:Zoubin Ghahramani
DESCRIPTION:Traditional wisdom holds that once documents are turned into b
 ag-of-words (unigram count) vectors\, word orders are completely lost. We 
 introduce an approach that\, perhaps surprisingly\, is able to learn a big
 ram language\nmodel from a set of bag-of-words documents. At its heart\, o
 ur approach is an EM algorithm that seeks a model which maximizes the regu
 larized marginal likelihood of the bag-of-words documents. In experiments 
 on seven corpora\, we observed that our learned bigram language models: \n
 # achieve better test set perplexity than unigram models trained on the sa
 me bag-of-words documents\, and are not far behind “oracle bigram models
 ” trained on the corresponding ordered documents\; \n# assign higher pro
 babilities to sensible bigram word\npairs\; and\n# improve the accuracy of
  ordered document\nrecovery from a bag-of-words. \nOur approach opens the 
 door to novel phenomena\, for example\, privacy leakage from index files.\
 n\nThis work was originally presented at ACL 2008\, and is in collaboratio
 n with Xiaojin "Jerry" Zhu (UW)\, Michael Rabbat (McGill)\, and Rob Nowak 
 (UW).\n\nBio: Andrew Goldberg is a 4th year PhD student at UW-Madison. He 
 is broadly interested in statistical machine learning and natural language
  processing. He specializes in semi-supervised learning and is also part o
 f a "text-to-picture synthesis" project that combines machine learning\, N
 LP\, and computer vision for aiding communication.
LOCATION:Engineering Department\, CBL Room 438
END:VEVENT
END:VCALENDAR
