Data Mining and Information Extraction for CiteSeerX and Friends
- π€ Speaker: Dr. C. Lee Giles, Pennsylvania State University
- π Date & Time: Friday 29 June 2012, 12:00 - 13:00
- π Venue: SW01, Computer Laboratory
Abstract
Cyberinfrastructure or e-science has become crucial in many areas of science where data access often defines scientific progress. Open source (OS) systems have greatly facilitated design and implementation and supporting cyberinfrastructure permitting the design of specialized integrated search engines and digital libraries which offer many opportunities for domain relevant information and knowledge extraction, such as citation extraction, automated indexing and ranking, chemical formulae search, table indexing, etc. We describe the open source SeerSuite architecture which is a modular, extensible system built on successful OS projects such as Lucene/Solr and discuss issues in building domain specific enterprise search and cyberinfrastructure for the sciences and academia. Because of the large amount of information crawled and/or search there are many scale problems in information extraction and data mining such as author and entity disambiguation, data extraction and ranking, etc. We highlight application domains with examples from computer science, CiteSeerX, and chemistry, ChemXSeer and related problem areas. Because such enterprise systems require unique information extraction approaches, several different machine learning methods, such as conditional random fields, support vector machines, mutual information based feature selection, sequence mining, etc. are critical for performance. We draw lessons for other e-science and cyberinfrastructure systems in terms of design, implementation and research and discuss future directions, systems and research.
Series This talk is part of the NLIP Seminar Series series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge Forum of Science and Humanities
- Cambridge Language Sciences
- Cambridge talks
- Chris Davis' list
- Computer Education Research
- Computing Education Research
- Department of Computer Science and Technology talks and seminars
- Graduate-Seminars
- Guy Emerson's list
- Interested Talks
- Language Sciences for Graduate Students
- ndk22's list
- NLIP Seminar Series
- ob366-ai4er
- PMRFPS's
- rp587
- School of Technology
- Simon Baker's List
- SW01, Computer Laboratory
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Dr. C. Lee Giles, Pennsylvania State University
Friday 29 June 2012, 12:00-13:00