University of Cambridge > Talks.cam > Language Technology Lab Seminars > Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

Detecting Text Reuse in Large Historical Corpora and Authorship Attribution of Premodern Documents

Add to your list(s) Download to your calendar using vCal

If you have a question about this talk, please contact Dimitri Kartsaklis.

This presentation covers two projects: a method to detect text reuse that can withstand extreme OCR noise, and real world applications of machine learning in authorship attribution. Detecting text reuse from historical documents is relevant to many, as it can shed light on many questions, such as how certain news spread or whether authors have plagiarized others. Finding these repeated passages can be fairly hard, as the documents are generally OCR transcribed and can contain extreme noise, where the text is bordering on unreadable. Authorship attribution is in no way a new field, yet machine learning has only had a limited spotlight in real world applications. This presentation highlights a case where machine learning provides new information that contradicts older manual attributions, and a method to attribute a document with multiple possible authors with very little training data.

This talk is part of the Language Technology Lab Seminars series.

Tell a friend about this talk:

This talk is included in these lists:

Note that ex-directory lists are not shown.

 

© 2006-2024 Talks.cam, University of Cambridge. Contact Us | Help and Documentation | Privacy and Publicity