University of Cambridge > Talks.cam > NLIP Seminar Series > Adapting a WSJ-trained Lexicalized-Grammar Parser to New Domains

Adapting a WSJ-trained Lexicalized-Grammar Parser to New Domains

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Johanna Geiss.

In this talk I will describe some experiments on adapting the C&C CCG parser to new domains. The parser was originally developed using CCGbank, the CCG version of the Penn Treebank, and is therefore tuned to newspaper text. The two new domains we consider are (1) biomedical abstracts and (2) questions for a QA system (using the term “domain” somewhat loosely in the latter case).

The porting approach we use is to train the parser at lower levels of representation than full syntactic derivations. The lexicalized nature of CCG (in which words are assigned syntactic categories that include subcategorization information) makes it possible to use a level of representation intermediate between POS tags and full derivations. For the biomedical data, we find that simply retraining the POS tagger leads to a large improvement in performance, and that using annotated data at the intermediate CCG lexical category level improves parsing accuracy further. A similar result is obtained for the question data, but the impact of retraining at the CCG lexical category level is much greater. We suggest that this is because the syntax of questions differs more from that of newspaper text than does the syntax of biomedical sentences, and we discuss some measures supporting this idea.

The parsing accuracies obtained for both biomedical and question data are in the same range as those reported for newspaper text, and higher than those previously reported for the biomedical domain on the same evaluation resource. The conclusion is that porting newspaper-trained parsers to new domains may not be as difficult as first thought (at least for parsers which use lexicalized grammars), but we note that different levels of representation may have different impacts on the porting process, depending on the characteristics of the target domain.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity