Deep Web Data: Analysis, Extraction, and Modelling
- 👤 Speaker: Prof. Pierre Senellart - Telecom Paris Tech
- 📅 Date & Time: Monday 26 April 2010, 11:00 - 12:00
- 📍 Venue: Lecture-room large (126 seats) Microsoft Research Ltd, Roger Needham Building, 7 J J Thomson Avenue (Off Madingley Road), CB3 0FB
Abstract
Abstract: The traditional way for Web search engines to retrieve and index data from the Web has been to crawl its hyperlink structure. This approach cannot capture data of the deep Web (also known as hidden Web or invisible Web), the huge amount of content available on the Web that lies behind Web forms or Web services. The focus of this talk is to discuss automatic and unsupervised methods for analyzing, extracting, and modelling Web data, given some initial domain of interest. A strong stress will be put in the presentation of applied and theoretical open problems, a solution of which would be of great help for undertanding data of the deep Web. We first introduce classical methods for matching Web forms with concepts from an ontology, and investigate how static analysis of JavaScript programs could be used to improve the quality of the understanding of a HTML form. We next present an unsupervised approach to information extraction over Deep Web result pages and highlight its limitations, insisting in particular on the need for a probabilistic representation of the extracted data. This leads us to consider models for probabilistic trees. After a quick survey of the literature on probabilistic XML , we will discuss interesting questions in verification aspects, in particular connecting the notion of probabilistic database with that of probabilistic schema.
Biography: Dr. Pierre Senellart is an Associate Professor in the Computer Science and Networking department at Télécom ParisTech, the French leading engineering school specialized in information technology. He is an alumni of the École normale supérieure and obtained his M.Sc. (2003) and his Ph.D. (2007) in Computer Science from Université Paris-Sud, studying under the supervision of Serge Abiteboul. Pierre Senellart has published articles in internationally renowned conferences and journals (PODS, AAAI , VLDB Journal, Journal of the ACM , etc.) He has been a member of the program committee of ECML /PKDD, WWW , VLDB, ICDE , a member of the repeatability committee of SIGMOD , and the organizer of the SIGMOD 2010 programming contest. He is also the Information Director of the Journal of the ACM . His research interests focus around theoretical aspects of database management systems and the World Wide Web, and more specifically on the intentional indexing of the deep Web, probabilistic XML databases, and graph mining. He also has an interest in natural language processing, and has been collaborating with SYSTRAN , the leading machine translation company.
Series This talk is part of the Microsoft Research Cambridge, public talks series.
Included in Lists
- All Talks (aka the CURE list)
- bld31
- Cambridge Centre for Data-Driven Discovery (C2D3)
- Cambridge talks
- Chris Davis' list
- Guy Emerson's list
- Interested Talks
- Lecture-room large (126 seats) Microsoft Research Ltd, Roger Needham Building, 7 J J Thomson Avenue (Off Madingley Road), CB3 0FB
- Microsoft Research Cambridge, public talks
- ndk22's list
- ob366-ai4er
- Optics for the Cloud
- personal list
- PMRFPS's
- rp587
- School of Technology
- Trust & Technology Initiative - interesting events
- yk449
Note: Ex-directory lists are not shown.
![[Talks.cam]](/static/images/talkslogosmall.gif)

Prof. Pierre Senellart - Telecom Paris Tech
Monday 26 April 2010, 11:00-12:00