University of Cambridge > Talks.cam > NLIP Seminar Series > A Polya Urn Document Language Model for Information Retrieval

A Polya Urn Document Language Model for Information Retrieval

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Tamara Polajnar.

Although the multinomial language model has been one of the most effective unigram models of information retrieval for over a decade, it does not model one important linguistic phenomenon relating to term-dependency; namely the tendency of a term to repeat itself within a document (i.e. word burstiness).

In this talk I will begin with a brief review of language modelling as applied to information retrieval. I will then present some work near completion in which we model document generation as a random process with reinforcement (a multivariate Polya process) and develop a Dirichlet compound multinomial language model that captures word burstiness. I will show that the new reinforced language model can be computed as efficiently as current retrieval models and that it significantly outperforms the multinomial model for a number of standard effectiveness metrics. I will conclude by presenting an analysis of the retrieval method which shows that it adheres to what is called the “verbosity hypothesis” and will show that the method essentially combines the term and document event spaces giving theoretical justification to tf-idf type schemes.

This talk is part of the NLIP Seminar Series series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity