Web Data Extraction with
Transcript of Web Data Extraction with
![Page 1: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/1.jpg)
`
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T. +32 71 490 700 • F. +32 71 490 799 • [email protected]
Web Data Extraction with
Fabrice Estiévenart ([email protected])Rencontres Mondiales du Logiciel Libre Bordeaux 8th July 2010
![Page 2: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/2.jpg)
CETIC, overview
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
• ICT research centre
• Created in 2001
• Initiatited by 3 universities
• Connection between
Industry & Research
• 3 departments, 40 researchers
• Contribution to Regional Economic Development
• International focus – European Research Area
![Page 3: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/3.jpg)
CETIC, the ICS team
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
•Knowledge Extraction from Unstructured Content
• Web wrapping
• Document clustering
• Text mining
•Search Engines
• Crawling
• Text extraction
• Analysis / Indexing
• Search
•Semantic Web
• Ontology and terminology engineering
• Microformats
![Page 4: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/4.jpg)
Context & motivations
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
InternetData of interest are hidden
within HTMLHTML has weak semanticWeb pages are constantlychanging
Source : xkcd.com
![Page 5: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/5.jpg)
Biggest challenges in web wrapping
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
• Collecting/organizing relevant documents– Intelligent crawling– Web services
• Locating data of interest within documents– RegExp– Element structure– External resources
![Page 6: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/6.jpg)
Retroweb, overview
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
Tool for web data extraction (web wrapping)
RetrowebGUI: semiautomated definition of extraction rules, visual process
RetrowebWrapper: web data extraction from extraction rules
![Page 7: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/7.jpg)
Retroweb, terminology
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
Web Pagehttp://www.imdb.com/title/tt0821642http://www.imdb.com/title/tt0858486http://www.imdb.com/title/tt0458525
Page Typeimdbmovie
Page Componenttitletaglineactors
Component ValueThe soloist
![Page 8: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/8.jpg)
Retroweb, model
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
![Page 9: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/9.jpg)
The wrapping process
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
Retroweb GUISelect a page sampleSelect values and name page componentsRefine extraction rules(Re)structure page componentsCheck component values
Retroweb WrapperExtract to XML
![Page 10: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/10.jpg)
Screenshot
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
![Page 11: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/11.jpg)
Technologies
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
![Page 12: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/12.jpg)
Retroweb, benefits
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
Easy: no need to learn a specific languageFlexible: only data of interest are extractedRobust: extraction rules are defined from a set of pagesExtensible: based on standards (XML, XPath) and opensource (Affero GPL v3)Portable: GNU/Linux, MSWindows
![Page 13: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/13.jpg)
Some applications
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
Web sites reverse engineeringSearch enginesCompetitive intelligenceSemantic annotation of corpus
![Page 14: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/14.jpg)
Some similar applications
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
● Lixto Visual Developer● Spinoff from the Wien University● Services in web intelligence● Visual tool based on EclipseRCP
● Dapper● Free web application for data extraction● Not extensible
![Page 15: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/15.jpg)
Wanna be involved ?
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
• Centre d'Expertise en Logiciel Libre à Vocation Industrielle (CELLAVI)
• https://forge.pallavi.be
![Page 16: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/16.jpg)
Next steps
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T.+32 71 490 700 • F.+32 71 490 799 • [email protected]
Search engine integrationWeb pages collection and clusteringSemantic indexationAdvanced search
Error detectionRules maintenance
Semantic import/exportMicroformats (RSS, FOAF, ...)Ontology population
![Page 17: Web Data Extraction with](https://reader035.fdocuments.in/reader035/viewer/2022071601/613d3bc2736caf36b75ae77c/html5/thumbnails/17.jpg)
Thanks for your attentionQuestions?
CETIC a.s.b.l. • Bâtiment Eole • Rue des Frères Wright, 29/3 • B6041 Charleroi (Belgique) • T. +32 71 490 700 • F. +32 71 490 799 • [email protected]