Deep Web Crawling and Mining
description
Transcript of Deep Web Crawling and Mining
![Page 1: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/1.jpg)
Deep Web Crawling and Mining
Presented by: Group 17
AIA 8803 CourseFeb 28, 2008
![Page 2: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/2.jpg)
What’s the Problem? Large Amount of Deep Web Content
Refers to World Wide Web content that is not part of the surface Web indexed by search engines (Bergman, 2001)
In 2000, it was estimated that the deep Web contained approximately 7,500 terabytes of data and 550 billion individual documents
Characteristics of Deep Web Data: Mostly generated by backend database Intrinsic – behind database scheme
![Page 3: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/3.jpg)
![Page 4: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/4.jpg)
Our solution Deep web crawling
Iterative querying Deep web mining
Attribute labeling Advanced search
Database construction Object-level search Comparison
![Page 5: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/5.jpg)
Deep Web Crawling Why it’s difficult in dynamic web space?
Hidden Web, Deep Web Different from traditional web crawler where a
hyperlink graph is traversed with BFS or WFS to crawl web pages
Seed-based crawler Seed Crawl New Seed Crawl …
![Page 6: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/6.jpg)
An Crawler Example Initial seed: car New seeds: Lincoln, Deluxe, TracRac, Truc
k, SUV
![Page 7: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/7.jpg)
Deep Web Mining What we have:
Large amount of web pages gathered from the crawler
Machine Learning /
Data Mining
techniques
What we need: A structured database for web application
![Page 8: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/8.jpg)
Deep Web Mining
Problem Different web sites may have different layouts
![Page 9: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/9.jpg)
Deep Web Mining Conditional Random Fields (CRFs)
An undirected graphic model X (Gray nodes): observations
Features extracted from the crawled web pages Y (White nodes): hidden states
Labels Product name, price, customer rating, etc..
CRF models the conditional probability p(y|x) Key advantage
Rich, correlated feature sets
![Page 10: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/10.jpg)
Web database from mining Data fusion will be necessary where
multiple copies of data exist across sites
![Page 11: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/11.jpg)
What We Have• Web object extraction and mining• Structured databases of web objects
Next Step• improve the state-of-the-arts Web search• make some money
![Page 12: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/12.jpg)
Building Advanced Web Search Application
1. object-level web search combine different features or attributes of an identical Web object in different Web sites to respond to a user query
DBLP (manual but high-precise) Citeseer (auto but less-precise)
Challenge is on how to build an precise and automatic object-level search platform DBLP?
2. comparison Web searchcompare attributes (e.g. price, performance, etc) of Web objects across different sites or sources
![Page 13: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/13.jpg)
Building a LAMP Server "LAMP" system: Linux, Apache, MySQL an
d PHP.
1. low acquisition cost
2. ubiquity of its components
![Page 14: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/14.jpg)
Fancy restaurant (dynamic web server) Apache: chef. PHP: waiter. MySQL: stockroom of ingredients When a patron (or Web site visitor) comes to your restaurant, he or sh
e sits down and orders a meal with specific requirements. The waiter (PHP) takes those specific requirements back to the kitche
n and passes them off to the chef (Apache). The chef then goes to the stockroom (MySQL) to retrieve the ingredie
nts (or data) to prepare the meal and presents the final dish to the patron, exactly the way he or she ordered the meal.
![Page 15: Deep Web Crawling and Mining](https://reader035.fdocuments.in/reader035/viewer/2022081504/56814e7e550346895dbc1b67/html5/thumbnails/15.jpg)
Thank you.
Q&A