Web Mining
-
Upload
mudit-dholakia -
Category
Documents
-
view
16 -
download
1
Transcript of Web Mining
![Page 1: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/1.jpg)
Web MiningBy:-Mudit Dholakia
Guide:-Dr. Amit Ganatra Sir
![Page 2: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/2.jpg)
What is web mining?
• Web mining is the use of the data mining techniques to automatically discover and extract information from web documents/services.• Discovering Knowledge from and about WWW - is one of the basic
abilities of an intelligent agent.
![Page 3: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/3.jpg)
Knowledge
WWW
![Page 4: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/4.jpg)
Web Mining .vs. Data Mining
• Structure (or lack of it)• Textual information and linkage structure
• Scale• Data generated per day is comparable to largest conventional data
warehouses
• Speed• Often need to react to evolving usage patterns in real-time (e.g.,
merchandising)
![Page 5: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/5.jpg)
Web Mining topics
• Web graph analysis• Power Laws and The Long Tail• Structured data extraction• Web advertising • Systems Issues
![Page 6: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/6.jpg)
Size of the Web
• Number of pages• Technically, infinite• Much duplication (30-40%)• Best estimate of “unique” static HTML pages comes from search engine
claims• Until last year, Google claimed 8 billion(?), Yahoo claimed 20 billion• Google recently announced that their index contains 1 trillion pages
• How to explain the discrepancy?
![Page 7: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/7.jpg)
The web as a graph
• Pages = nodes, hyperlinks = edges• Ignore content• Directed graph
• High linkage• 10-20 links/page on average• Power-law degree distribution
![Page 8: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/8.jpg)
Structure of Web graph
![Page 9: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/9.jpg)
Power-law degree distribution
![Page 10: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/10.jpg)
Measures
• Structure• In-degrees• Out-degrees• Number of pages per site
• Usage patterns• Number of visitors• Popularity e.g., products, movies, music
![Page 11: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/11.jpg)
The Long Tail
![Page 12: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/12.jpg)
Measures
• Shelf space is a scarce commodity for traditional retailers • Also: TV networks, movie theaters,…
• The web enables near-zero-cost dissemination of information about products• More choice necessitates better filters
• Recommendation engines (e.g., Amazon)• How Into Thin Air made Touching the Void a bestseller
![Page 13: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/13.jpg)
Searching the Web
Content aggregatorsThe Web Content consumers
![Page 14: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/14.jpg)
Two approaches for analyzing data
• Machine Learning approach• Emphasizes sophisticated algorithms e.g., Support Vector Machines• Data sets tend to be small, fit in memory
• Data Mining approach• Emphasizes big data sets (e.g., in the terabytes)• Data cannot even fit on a single disk!• Necessarily leads to simpler algorithms
![Page 15: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/15.jpg)
View of mining system
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU…
![Page 16: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/16.jpg)
Issues
• Web data sets can be very large • Tens to hundreds of terabytes
• Cannot mine on a single server!• Need large farms of servers
• How to organize hardware/software to mine multi-terabyte data sets• Without breaking the bank!
![Page 17: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/17.jpg)
What it should do?
• Finding relevant information • Low precision and unindexed information
• Creating new knowledge out of available information on the web• A data-triggered process
• Personalizing the information• Personal preference in content and presentation of the information
• Learning about the consumers • What does the customer want to do?
![Page 18: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/18.jpg)
Direct vs Indirect web mining
• Web mining techniques can be used to solve the information overload problems:
DirectlyAddress the problem with web mining techniques
E.g. newsgroup agent classifies whether the news as relevantIndirectly
Used as part of a bigger application that addresses problemsE.g. used to create index terms for a web search service
![Page 19: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/19.jpg)
Web Mining Categories
• Web Content MiningDiscovering useful information from web page
contents/data/documents.
• Web Structure MiningDiscovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
• Web Usage MiningExtraction of interesting knowledge from logging information
produced by web servers.Usage data from logs, user profiles, user sessions, cookies, user
queries, bookmarks, mouse clicks and scrolls, etc.
![Page 20: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/20.jpg)
Types
• Web Mining• Web Content Mining• Web Structure Mining• Web Usage Mining
![Page 21: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/21.jpg)
IRSystem
Query
Documentssource
RankedDocuments
Document
DocumentDocument
ClusteringSystem
Similarity measure
Documentssource
DocDo
cDoc
Doc
Doc
DocDoc
Doc
DocDoc
![Page 22: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/22.jpg)
Web Content Data Structure
• Web content consists of several types of data• Text, image, audio, video, hyperlinks.
• Unstructured – free text• Semi-structured – HTML• More structured – Data in the tables or database generated HTML
pagesNote: much of the Web content data is unstructured text data.
![Page 23: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/23.jpg)
Web Content Mining
• Unstructured DocumentsBag of words to represent unstructured documents
Takes single word as feature Ignores the sequence in which words occur
Features could be Boolean
Word either occurs or does not occur in a document Frequency based
Frequency of the word in a documentVariations of the feature selection include
Removing the case, punctuation, infrequent words and stop wordsFeatures can be reduced using different feature selection techniques:
Information gain, mutual information, cross entropy. Stemming: which reduces words to their morphological roots.
![Page 24: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/24.jpg)
Web Content Mining
• Semi-Structured DocumentsUses richer representations for features
Due to the additional structural information in the hypertext document (typically HTML and hyperlinks)
Uses common data mining methods (whereas unstructured might use more text mining methods)
Application: Hypertext classification or categorization and clustering, learning relations between web documents, learning extraction patterns or rules, and finding patterns in semi-structured data.
![Page 25: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/25.jpg)
Web Content Mining: DB View
• The database techniques on the Web are related to the problems of managing and querying the information on the Web.• DB view tries to infer the structure of a Web site or transform a Web site to
become a database Better information managementBetter querying on the Web
• Can be achieved by:Finding the schema of Web documentsBuilding a Web warehouseBuilding a Web knowledge baseBuilding a virtual database
![Page 26: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/26.jpg)
Web Content Mining: DB View• DB view mainly uses the Object Exchange Model (OEM)
Represents semi-structured data by a labeled graphThe data in the OEM is viewed as a graph, with objects as the vertices
and labels on the edges Each object is identified by an object identifier [oid] and Value is either atomic or complex
• Process typically starts with manual selection of Web sites for doing Web content mining• Main application:
• The task of finding frequent substructures in semi-structured data• The task of creating multi-layered database
![Page 27: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/27.jpg)
![Page 28: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/28.jpg)
Taxonomies
• Ranking• Graph Search• Communities• Hyperlink Induced Topic Search• SEO• Hub & Authorities
![Page 29: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/29.jpg)
Web Structure Mining
• Interested in the structure of the hyperlinks within the Web• Inspired by the study of social networks and citation analysis• Can discover specific types of pages(such as hubs, authorities, etc.) based on
the incoming and outgoing links.
• Application: • Discovering micro-communities in the Web , • measuring the “completeness” of a Web site
![Page 30: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/30.jpg)
Web Usage Mining• Tries to predict user behavior from interaction
with the Web• Wide range of data (logs)
Web client data Proxy server data Web server data
• Two common approaches Maps the usage data of Web server into relational tables before
an adapted data mining techniques Uses the log data directly by utilizing special pre-processing
techniques
![Page 31: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/31.jpg)
Web Usage Mining
Pre-Processing Pattern Discovery Pattern Analysis
User sessionFile Rules and Patterns Interesting
Knowledge
![Page 32: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/32.jpg)
XML View
Generalized Descriptions
More Generalized Descriptions
Layer0
Layer1
Layern
...
![Page 33: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/33.jpg)
33
Use of Multi-Layer Meta Web• Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis• Approximate and intelligent query answering• Web high-level query answering (WebSQL, WebML)• Web content and structure mining• Observing the dynamics/evolution of the Web
• Is it realistic to construct such a meta-Web?• Benefits even if it is partially constructed• Benefits may justify the cost of tool development,
standardization and partial restructuring
![Page 34: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/34.jpg)
Web Search Products and ServicesAlta VistaDB2 text extenderExciteFulcrumGlimpse (Academic)Google! Inforseek Internet Inforseek Intranet Inktomi (HotBot) Lycos
PLSSmart (Academic)Oracle text extender Verity Yahoo!
![Page 35: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/35.jpg)
Web Usage Mining
• Typical problems: • Distinguishing among unique users, server sessions,
episodes, etc. in the presence of caching and proxy servers
• Often Usage Mining uses some background or domain knowledge
E.g. site topology, Web content, etc.
![Page 36: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/36.jpg)
Web Usage Mining
• Applications:• Two main categories:
Learning a user profile (personalized)Web users would be interested in techniques that learn their needs and preferences automatically
Learning user navigation patterns (impersonalized)Information providers would be interested in techniques that
improve the effectiveness of their Web site
![Page 37: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/37.jpg)
References
• www.cs.jyu.fi/ai/vagan/Web_Mining.ppt• www.infolab.stanford.edu/~ullman/mining/webMiningOverview.ppt• www.psl.cs.columbia.edu/classes/.../Presentation_Jagriti_Mishra.ppt
x
![Page 38: Web Mining](https://reader034.fdocuments.in/reader034/viewer/2022042701/55cb430cbb61ebad668b4661/html5/thumbnails/38.jpg)
Thank You