Making the World's Scientific Information (More) Organized, Accessible, and Usable
- đ¤ Speaker: Ted Briscoe - University of Cambridge
- đ Date & Time: Friday 12 February 2010, 12:00 - 13:00
- đ Venue: SW01, Computer Laboratory
Abstract
Web portals like Google Scholar and ScienceDirect have revolutionized access to scientific information by making it possible to identify relevant papers via keyword search, and then to browse them on-line. However, as scientific information continues to grow exponentially, and as (e-)science embraces automation, keeping abreast of and exploiting the information in these papers effectively is becoming impossible.
I’ll describe a prototype scientific literature search and information extraction system, developed in collaboration with the FlyBase (Fruit Fly Genomics) curation team, designed to support very fine-grained but intuitive querying and access to information in a collection of papers. FlySearch indexes annotated papers and supports integrated search over individual sentences and images, aggregating information across the collection. For example, one can search captions describing a specific gene regulating a biological process and restrict the associated images to a specific body part.
The system rests on a processing pipeline in which a Portable Document Format paper is first converted to Scientific eXtensible Mark-up Language, preserving its logical structure but, for example, separating images, tables, and references from running text, and then applying specialized text and image processing tools to the different components of the paper. These are able to compute image similarity, recognize gene names, facts about genes, and their relationships to other biological entities, etc. They have been designed to be as generic as possible to facilitate application to different areas of science. Where they require domain-specific tuning they have been developed using semi-supervised machine learning methods to minimize such costs.
Initial results suggest that many aspects of the user interface need refinement but the underlying search functionality is able to improve speed and precision significantly over keyword-based document-level search. Nevertheless, many further challenges remain, of which perhaps the most pressing is handling more forms of contextually-mediated variant ways of expressing the same meaning, but we would also like to be able to go beyond finding and extracting relations between biological entitites and, for example, support (e.g. temporal) reasoning about biological events.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- SW01, Computer Laboratory
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Ted Briscoe - University of Cambridge
Friday 12 February 2010, 12:00-13:00