Machine Learning Applications / Challenges in Natural Language Parsing
- đ¤ Speaker: Ted Briscoe, Computer Laboratory
- đ Date & Time: Thursday 15 November 2007, 16:00 - 18:00
- đ Venue: LR4, Engineering, Department of
Abstract
A decade or so ago, the consensus was that full syntactic parsing (i.e. recovering all the grammatical relations between words in a sentence) was too brittle to be viable. Data-driven approaches building on large treebanks have changed this, and today full parsers are being deployed in applications such as information extraction.
I’ll describe the parsing task, a standard intrinsic evaluation scheme, and two state-of-the-art contenders: our RASP system and Clark and Curran’s CCG parser. The latter relies heavily on fully supervised training to estimate both configurational and (bi)lexical parameters to resolve syntactic ambiguity, which makes it more accurate on in-domain test data (financial news) but harder to move to a new domain (e.g. biomedical scientific papers). I’ll describe recent work on semi-supervised training / bootstrapping of RASP , which relies a lot less on large in-domain treebanks, and the consequent applications and challenges for machine learning.
To acquire configurational parameter estimates for RASP , we used self-training over partially bracketed input, bootstrapping an initial model from the unambiguous portion of the data and then using this to weight counts from the ambiguous data. To acquire lexical subclasses, we use unlexicalized RASP to parse data and then subclassify words according to the contexts in which they occur. These subclassifications (e.g. of verbs into (in|di)transitive uses) are used to estimate parameters like P(subclass_i | verb_j), and these are then integrated into parse ranking.
If there is time, I’ll talk about possible extensions of this work. Most parsers output a directed graph in which each node is labelled with a word token and each edge is labelled with a grammatical relation. RASP can also output a weighted directed graph of all relations hypothesised by the N best parses. To acquire bilexical collocational information to rank parses or to extract nuggets of information from documents, we would like to develop domain appropriate and efficient methods to compute (sub)graph (dis)similarity.
Briscoe, E.J., J. Carroll and R. Watson (2006) The Second Release of the RASP System, acl.ldc.upenn.edu/P/P06/P06-4020.pdf
Clark, S. and J. Curran (2007) Formalism-Independent Parser Evaluation with CCG and DepBank, acl.ldc.upenn.edu/P/P07/P07-1032.pdf
Series This talk is part of the Machine Learning @ CUED series.
Included in Lists
- All Talks (aka the CURE list)
- Biology
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge Neuroscience Seminars
- Cambridge talks
- CBL important
- Chris Davis' list
- Creating transparent intact animal organs for high-resolution 3D deep-tissue imaging
- dh539
- dh539
- Featured lists
- Guy Emerson's list
- Hanchen DaDaDash
- Inference Group Summary
- Information Engineering Division seminar list
- Interested Talks
- Joint Machine Learning Seminars
- Life Science
- Life Sciences
- LR4, Engineering, Department of
- Machine Learning @ CUED
- Machine Learning Summary
- ML
- ndk22's list
- Neuroscience
- Neuroscience Seminars
- Neuroscience Seminars
- ob366-ai4er
- Required lists for MLG
- rp587
- Seminar
- Simon Baker's List
- Stem Cells & Regenerative Medicine
- Trust & Technology Initiative - interesting events
- yk373's list
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)


Thursday 15 November 2007, 16:00-18:00