University of Cambridge > Talks.cam > Machine Learning @ CUED > Machine Learning Applications / Challenges in Natural Language Parsing

Machine Learning Applications / Challenges in Natural Language Parsing

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Zoubin Ghahramani.

Machine Learning Tutorial Lecture

A decade or so ago, the consensus was that full syntactic parsing (i.e. recovering all the grammatical relations between words in a sentence) was too brittle to be viable. Data-driven approaches building on large treebanks have changed this, and today full parsers are being deployed in applications such as information extraction.

I’ll describe the parsing task, a standard intrinsic evaluation scheme, and two state-of-the-art contenders: our RASP system and Clark and Curran’s CCG parser. The latter relies heavily on fully supervised training to estimate both configurational and (bi)lexical parameters to resolve syntactic ambiguity, which makes it more accurate on in-domain test data (financial news) but harder to move to a new domain (e.g. biomedical scientific papers). I’ll describe recent work on semi-supervised training / bootstrapping of RASP , which relies a lot less on large in-domain treebanks, and the consequent applications and challenges for machine learning.

To acquire configurational parameter estimates for RASP , we used self-training over partially bracketed input, bootstrapping an initial model from the unambiguous portion of the data and then using this to weight counts from the ambiguous data. To acquire lexical subclasses, we use unlexicalized RASP to parse data and then subclassify words according to the contexts in which they occur. These subclassifications (e.g. of verbs into (in|di)transitive uses) are used to estimate parameters like P(subclass_i | verb_j), and these are then integrated into parse ranking.

If there is time, I’ll talk about possible extensions of this work. Most parsers output a directed graph in which each node is labelled with a word token and each edge is labelled with a grammatical relation. RASP can also output a weighted directed graph of all relations hypothesised by the N best parses. To acquire bilexical collocational information to rank parses or to extract nuggets of information from documents, we would like to develop domain appropriate and efficient methods to compute (sub)graph (dis)similarity.

Briscoe, E.J., J. Carroll and R. Watson (2006) The Second Release of the RASP System, acl.ldc.upenn.edu/P/P06/P06-4020.pdf

Clark, S. and J. Curran (2007) Formalism-Independent Parser Evaluation with CCG and DepBank, acl.ldc.upenn.edu/P/P07/P07-1032.pdf

This talk is part of the Machine Learning @ CUED series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity