Web Harvesting Ppt
-
Upload
vinod-vinu -
Category
Documents
-
view
35 -
download
0
description
Transcript of Web Harvesting Ppt
![Page 1: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/1.jpg)
Web Harvest
Web Harvesting RAHUL.MADANU09BK1A0535CSE-A
1
W
![Page 2: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/2.jpg)
Introduction -
• A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing.
• A web search engine is software code that is designed to search for information on the World Wide Web.
• A database is an organized collection of data.
Web Harvest 2
![Page 3: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/3.jpg)
Web Harvest 3
Existing System
![Page 4: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/4.jpg)
Q: How does a search engine know that all
these pages contain the query terms?
A: Because all of those pages have been crawled
Web Harvest 4
![Page 5: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/5.jpg)
Motivation for crawlers
Support universal search engines (Google, Yahoo, MSN/Windows Live, Ask, etc.)
Vertical (specialized) search engines, e.g. news, shopping, papers, recipes, reviews, etc.
Business intelligence: keep track of potential competitors, partners
Monitor Web sites of interestEvil: harvest emails for spamming, phishing…… Can you think of some others?…
Web Harvest 5
![Page 6: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/6.jpg)
Many names• Crawler
• Spider
• Robot (or bot)
• Web agent
• Wanderer, worm, …
• And famous instances: googlebot, scooter, slurp, msnbot, …
Web Harvest 6
![Page 7: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/7.jpg)
A crawler within a search engine
Web Harvest 7
Web
Text index PageRank
Page repository
googlebot
Text & link analysisQuery
hits
Ranker
![Page 8: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/8.jpg)
Page Rank
Web Harvest 8
![Page 9: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/9.jpg)
Page Rank Probability -
Web Harvest 9
![Page 10: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/10.jpg)
Proposed System
Web Harvest 10
![Page 11: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/11.jpg)
Aim :
Have to set higher memory range.Eliminate all file not found.
Removing negative dictionary.Need to obtain Base URL.
Multi-processing or multi-threading.
Web Harvest 11
![Page 12: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/12.jpg)
Recovering Issues -• Don’t want to fetch same page twice or Save up in Marked list.
• Soft fail for timeout, server not responding, file not found, and other errors.• Noise words that do not carry meaning should be eliminated
(“stopped”) before they are indexed • E.g. in English: AND, THE, A, AT, OR, ON, FOR, etc…• Need to obtain Base URL from HTTP header
• Base: http://www.cnn.com/linkto/
• Relative URL: intl.html• Absolute URL: http://www.cnn.com/linkto/intl.html
• Overlap the above delays by fetching many pages concurrently• Can use multi-processing or multi-threading
Web Harvest 12
![Page 13: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/13.jpg)
Policy -
• Coverage
• Freshness
• Trade-off (Subjective)
Web Harvest 13
![Page 14: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/14.jpg)
Algorithm - • Algorithm for classifying a crawler data into the
database
Bayesian approaches are a fundamentally important DM technique. Given the probability distribution, Bayes classifier can provably achieve the optimal result.
Bayes Classifier is that it assumes all attributes are independent of each other.
Web Harvest 14
![Page 15: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/15.jpg)
Basic crawlers• This is a sequential crawler
• Seeds can be any list of starting URLs
• Order of page visits is determined by frontier data structure
• Stop criterion can be anything
![Page 16: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/16.jpg)
Graph traversal (BFS or DFS?)
• Breadth First Search• Implemented with QUEUE (FIFO) • Finds pages along shortest paths
• If we start with “good” pages, this keeps us close; maybe other good stuff…
• Depth First Search• Implemented with STACK (LIFO)
• Wander away (“lost in cyberspace”)
Web Harvest 16
![Page 17: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/17.jpg)
Breadth-first crawlers• BF crawler tends to
crawl high-PageRank pages very early
• Therefore, BF crawler is a good baseline to gauge other crawlers
Average Number of Pages Crawled
![Page 18: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/18.jpg)
Crawler ethics and conflicts• Crawlers can cause trouble, even unwillingly, if not
properly designed to be “polite” and “ethical”
• lexical analysis is the process of converting a sequence of characters into a sequence of tokens.
http://foo.com/woo/foo/woo/foo/woo
• For example, sending too many requests in rapid succession to a single server can amount to a Denial of Service (DoS) attack!• Server administrator and users will be upset• Crawler developer/admin IP address may be blacklisted
Web Harvest 18
![Page 19: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/19.jpg)
Crawler etiquette (important!)
Spread the load, do not overwhelm a server• Make sure that no more than some max. number of requests to any single
server per unit time, say < 1/second
Honor the Robot Exclusion Protocol• A server can specify which parts of its document tree any crawler is or is
not allowed to crawl by a file named ‘robots.txt’ placed in the HTTP root directory, e.g. http://www.indiana.edu/robots.txt
• Crawler should always check, parse, and obey this file before sending any requests to a server
• More info at:• http://www.google.com/robots.txt• http://www.robotstxt.org/wc/exclusion.html
Web Harvest 19
![Page 20: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/20.jpg)
A Basic crawler in Javaimport java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
public class Main {
public static void main(String[] args) {
try {
URL my_url = new URL("http://www.blogspot.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream()));
String strTemp = "";
while(null != (strTemp = br.readLine())){
System.out.println(strTemp);
} } catch (Exception ex) { ex.printStackTrace(); } } }
Web Harvest 20
![Page 21: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/21.jpg)
Xml (Used) -
• Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
• Many application programming interfaces (APIs) have been developed to aid software developers with processing XML data, and several schema systems exist to aid in the definition of XML-based languages.
• Any forms of Database.
Web Harvest 21
![Page 22: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/22.jpg)
Comparison -
Fields Google Web Harvest Time out Accepted Eliminated N Dictionary Accepted Eliminated Dynamic Pages Doubled Updated URL Relative Base Table Big Table Bayes Table Search Page Rank Limit Rank
Web Harvest 22
![Page 23: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/23.jpg)
Web Harvest 23
![Page 24: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/24.jpg)
Conclusion -
• Web Harvesting Engine Marketing has one of the lowest costs per customer acquisition.
• Web Harvesting Engine is one of the most cost efficient ways to reach a target market for a small, medium or large business.
• Traditional marketing such as catalog mail, trade magazines, direct mail, TV or radio involves passive participation by your audience and targeting can very greatly from one medium to another.
Web Harvest 24
![Page 25: Web Harvesting Ppt](https://reader034.fdocuments.in/reader034/viewer/2022052410/552a2710550346416e8b4713/html5/thumbnails/25.jpg)
Queries ?
Web Harvest 25