Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLP
- đ¤ Speaker: Shane Bergsma - Johns Hopkins University
- đ Date & Time: Monday 10 October 2011, 12:00 - 13:00
- đ Venue: SW01, Computer Laboratory
Abstract
In this talk, I contrast NLP systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. I focus on the problem of finding the syntactic structure of complex noun phrases. The unannotated monolingual data is helpful when ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when ambiguity can be resolved by the order of words in the translation. I show how to iteratively improve the performance of a noun phrase parser by co-training over the monolingual and bilingual feature views. The co-trained system achieves state-of-the-art results (both within and across domains) starting from only a handful of labeled examples. I also describe NLP systems that successfully exploit the huge volume of labeled images on the web. If a picture’s worth a thousand words, then online visual data might comprise the biggest linguistic corpus of all.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- SW01, Computer Laboratory
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Shane Bergsma - Johns Hopkins University
Monday 10 October 2011, 12:00-13:00