Log in

Cambridge users (raven) details

Other users details

No account? details

Information on

Subscribing to talks details

Finding a talk details

Adding a talk details

Disseminating talks details

Help and Documentation details

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLP

Add to your list(s) Download to your calendar using vCal

Shane Bergsma - Johns Hopkins University
Monday 10 October 2011, 12:00-13:00
SW01, Computer Laboratory.

If you have a question about this talk, please contact Thomas Lippincott.

In this talk, I contrast NLP systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. I focus on the problem of finding the syntactic structure of complex noun phrases. The unannotated monolingual data is helpful when ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when ambiguity can be resolved by the order of words in the translation. I show how to iteratively improve the performance of a noun phrase parser by co-training over the monolingual and bilingual feature views. The co-trained system achieves state-of-the-art results (both within and across domains) starting from only a handful of labeled examples. I also describe NLP systems that successfully exploit the huge volume of labeled images on the web. If a picture’s worth a thousand words, then online visual data might comprise the biggest linguistic corpus of all.

This talk is part of the NLIP Seminar Series series.

This talk is included in these lists:

Note that ex-directory lists are not shown.

Log in

Information on

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in NLP

This talk is included in these lists:

Other lists

Other talks