Learning Bigrams from Unigrams
- 👤 Speaker: Andrew B. Goldberg (University of Wisconsin, Madison)
- 📅 Date & Time: Tuesday 23 September 2008, 14:00 - 15:00
- 📍 Venue: Engineering Department, CBL Room 438
Abstract
Traditional wisdom holds that once documents are turned into bag-of-words (unigram count) vectors, word orders are completely lost. We introduce an approach that, perhaps surprisingly, is able to learn a bigram language model from a set of bag-of-words documents. At its heart, our approach is an EM algorithm that seeks a model which maximizes the regularized marginal likelihood of the bag-of-words documents. In experiments on seven corpora, we observed that our learned bigram language models:- achieve better test set perplexity than unigram models trained on the same bag-of-words documents, and are not far behind “oracle bigram models” trained on the corresponding ordered documents;
- assign higher probabilities to sensible bigram word pairs; and
- improve the accuracy of ordered document recovery from a bag-of-words. Our approach opens the door to novel phenomena, for example, privacy leakage from index files.
This work was originally presented at ACL 2008 , and is in collaboration with Xiaojin “Jerry” Zhu (UW), Michael Rabbat (McGill), and Rob Nowak (UW).
Bio: Andrew Goldberg is a 4th year PhD student at UW-Madison. He is broadly interested in statistical machine learning and natural language processing. He specializes in semi-supervised learning and is also part of a “text-to-picture synthesis” project that combines machine learning, NLP , and computer vision for aiding communication.
Series This talk is part of the Machine Learning @ CUED series.
Included in Lists
- All Talks (aka the CURE list)
- Biology
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge Neuroscience Seminars
- Cambridge talks
- CBL important
- Chris Davis' list
- Creating transparent intact animal organs for high-resolution 3D deep-tissue imaging
- dh539
- dh539
- Engineering Department, CBL Room 438
- Featured lists
- Guy Emerson's list
- Hanchen DaDaDash
- Inference Group Summary
- Information Engineering Division seminar list
- Interested Talks
- Joint Machine Learning Seminars
- Life Science
- Life Sciences
- Machine Learning @ CUED
- Machine Learning Summary
- ML
- ndk22's list
- Neuroscience
- Neuroscience Seminars
- Neuroscience Seminars
- ob366-ai4er
- Required lists for MLG
- rp587
- Seminar
- Simon Baker's List
- Stem Cells & Regenerative Medicine
- Trust & Technology Initiative - interesting events
- yk373's list
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Tuesday 23 September 2008, 14:00-15:00