Web Mining B Tech_2015 .pdf
-
Upload
anonymous-s5hiohp -
Category
Documents
-
view
3 -
download
0
Transcript of Web Mining B Tech_2015 .pdf
•1
Web: A huge, widely-distributed highlyheterogeneous, semi-structuredhypertext / hypermedia, interconnectedinformation repository.
Web is a huge collection of documentsplus
Hyper-link information
Access and usage information
1
Web Specifics
2
It is the use of data mining
techniques to automatically discover
and extract information from Web
documents/services
Etzioni CACM
Web Mining
•2
Some interesting problems on Web mining
Mining what Web search engine finds
Identification of authoritative Web pages
Identification of Web communities
Web document classification
Warehousing a Meta-Web: Web yellowpage service
Weblog mining (usage, access, andevolution)
Intelligent query answering in Websearch
3
A Few Themes in Web Mining
Web pages
– Content of actual web pages.
Intra-page structures– HTML or XML code for web pages. -
Anchor
Inter-page structures– Linkage structure between web pages
4
WEB DATA
•3
Usage data
- How web pages are accessed by end
user.
User Profile
- demographic and registrationinformation obtained about user.
- Information found in cookies
5
WEB DATA
6
WEB MINING TAXONOMY
•4
7
� Web page content mining
Summarization of web page contents
( Web ML queries)
� Web structure mining
Summarization of search engine result
( Page Rank)
� Web usage mining
mining for user browsing and access pattern, forunderstanding user’s behaviors on the Web.
(Web log Mining)
Over view of Web Mining
The taxonomy of web mining
“Web Content Mining is the processof extracting useful informationfrom the contents of Webdocuments.
It may consist of text, images,audio, video, or structured recordssuch as lists and tables.”
8
Web Content Mining
•5
“Web Content Mining examines thecontent of web pages as well asresult of web search”
� The first traditional searching ofweb pages via content , whilesecond is a further search of webpages found from previoussearch.
9
Web Content Mining
Web content consists of several types ofdata : Text, image, audio, video,hyperlinks.
Unstructured – free text
Semi-structured – HTML
More structured – Data in the tables ordatabase generated HTML pages
Note: much of the Web content data isunstructured text data.
10
Web Content Data Structure
•6
Noisy data
Example: Spelling mistakes
Not well structured text
Chat rooms
� “r u available ?”
� “Hey whazzzzzz up”
� “Yup”
11
Text characteristics
Information retrieval (IR) is the area ofstudy concerned with searching fordocuments, for information withindocuments, and for metadata aboutdocuments.
Indexing and retrieval of textualdocuments
Web search engines are the most visibleIR applications
Include online library catalog
12
Information retrieval (IR)
•7
Information Retrieval
� Given:
� A source of textual documents
� A user query (text based)
IR
System
Query
E.g. Text
Documents
source
• Find:
• A set (ranked) of documents that
are relevant to the queryRanked
Documents
Document
Document
Document
Intelligent Information Retrieval
� meaning of words
� Synonyms “buy” / “purchase”
� Ambiguity “bat” (baseball vs. cricket)
� order of words in the query
� user dependency for the data
� direct feedback
� indirect feedback
� authority of the source
� IBM is more likely to be an authorized
source then my second far cousin
•8
Itis a type of information retrieval whosegoal is to
automatically extract structured informationfrom unstructured and/or semi-structuredmachine-readable documents.
In most of the case this activity concernprocessing human language texts bymeans of natural language processing(NLP).
Goal is to allow logical reasoning to drawinferences based on the logical content ofthe input data
15
Information Extraction(IE)
Subtask of IE
1. Named entity extraction which could include:
a)Named entity recognition: recognition of
known entity names (for people and
organizations)
PERSON located in LOCATION (extracted
from the sentence “Amit is in Delhi.")
b)Relationship extraction: identification of
relations between entities, such as: PERSON
works for ORGANIZATION (extracted from
the sentence “Amit works for IBM.")
•9
Subtask of IE
2. Semi-structured information extraction :
Table extraction: finding and extracting tables
from documents. Comments extraction :
extracting comments from actual content of
article
3. Terminology extraction: finding the relevant
terms for a given corpus
� Given:
� A source of textual documents
� A well defined limited query (text based)
� Find:
� Sentences with relevantrelevant information
� Extract the relevant information and
ignore non-relevant information
(important!)
� Link related information and output in a
predetermined format
Information Extraction
•10
What is Information Extraction?
Extraction
System
Documents
source
Ranked
Documents
Relevant Info 1
Relevant Info 2
Relevant Info 3
Query 1(E.g. job title)
Query 2(E.g. salary)
Combine
Query Results
Mining the Web
IR / IE
System
Query
Documents
source
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Web Spider
•11
Text mining process
Text Mining Process
� Text preprocessing
� Syntactic/Semantic text analysis
� Features Generation
� Bag of words
� Features Selection
� Simple counting
� Statistics
� Text/Data Mining
� Classification- Supervised learning
� Clustering- Unsupervised learning
� Analyzing results
•12
Text Preprocessing
Text Cleanup
e.g., remove ads from web pages, normalize text
converted from binary formats, deal with tables,
figures and formulas, …
Tokenization
– Splitting up a string of characters into a set of
tokens.
– Need to deal with issues like:
• Apostrophes, e.g., “John’s sick”, is it 1 or 2 tokens?
• Hyphens, e.g., database vs. data-base vs. data base.
• How should we deal with “C++”, “A/C”, “:-)”, “…”?
� Part Of Speech (pos) tagging� Find the corresponding pos for each word
e.g., John (noun) gave (verb) the (det) ball (noun)
� ~98% accurate.
� Parsing� Generates a parse treeparse tree (graph) for each sentence
� Each sentence is a stand alone graph
Syntactic / Semantic text analysis
•13
Feature Generation: Bag of words
� Text document is represented by the words it
contains (and their occurrences)
� e.g., “Lord of the rings” → {“the”, “Lord”, “rings”, “of”}
� Highly efficient
� Makes learning far simpler and easier
� Order of words is not that important for certain
applications
� Stemming: identifies a word by its root
� e.g., flying, flew → fly
� Reduce dimensionality
� Stop words: The most common words are unlikely
to help text mining
� e.g., “the”, “a”, “an”, “you” …
•14
28
Web search basics
The Web
Ad indexes
Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)
Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages
Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages
Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages
Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages
Sponsored Links
CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com
Web crawler
Indexer
Indexes
Search
User
•15
Robot (spider) is a program that traversesthe hypertext sructure in the Web.
Collect information from visited pages
The Page(or pages) that the crawler starts withare referred to as the seed URL
Used to construct indexes for search engines
Periodic Crawler – may visit a certainnumber of pages and then stop, built anindex, and replace the existing index.
It is activated periodically and serach portionsof the Web. ( Notices and Circular)
29
Crawlers
Incremental Crawler – selectively searchesthe Web and only updates the indexincrementally as oppose to replacing them.incrementally modifies index
Focused Crawler – visits pages related to aparticular subject ( Topic of Interest)
30
Crawlers
•16
Only visit links from a page if that page isdetermined to be relevant.
A major piece of architecture is a hypertextthat associate relevance score for eachdocument with respect to crawl topic.
Distiller is used to determine which pagescontain link to many relevant pages. Theseare called hub pages.
The hub pages may not contain the relevantinformation, but they would be quiteimportant to facilitate continue searching
31
Focused Crawler
© Prentice Hall 32
Focused Crawler
•17
33
� Locating documents and services on the Web:
� WebCrawler, Alta Vista (http://www.altavista.com): scan millions of Web documents and create index of words
� Web robots : are software programs that
automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information.
� ShopBot -Retrieve product information from a
variety of vendor sites using only general information about the product domain.
Intelligent Search Agents
34
Web Outlier Mining
Web Outliers are data objects that show
significantly different characteristics
than other web data.
Web Outlier mining is the discovery and
analysis of rare and interesting pattern
from the web.
•18
Interested in the structure of thehyperlinks within the Web
Inspired by the study of social networksand citation analysis
Can discover specific types of pages(suchas hubs, authorities, etc.) based on theincoming and outgoing links.
Application:Discovering micro-communities in the Web ,
measuring the “completeness” of a Web site
35
Web Structure Mining
There are two approaches:
page rank: for discovering the mostimportant pages on the Web (as used inGoogle)
hubs and authorities: a more detailedevaluation of the importance of Webpages
Basic definition of importance:A page is important if important pages link toit
36
Web Structure Mining
•19
(1970) Researchers proposed methods ofusing citations among journal articles toevaluate the quality of reserach papers.
Unlike journal citations, the Web linkagehas some unique features:
one authority page will seldom have its Webpage point to its competitive authorities(CocaCola --���� Pepsi)
authoritative pages are seldom descriptive(Yahoo! may not contain the description „Websearch engine”)
37
Web Structure Mining
38
Predecessors and Successors of a Web Page
= =
Predecessors Successors
•20
Improve information retrieval by scoringWeb pages according to their importancein the Web or a thematic sub-domain of it.
Nodes with large fan-in (authorities)provide high quality information.
Nodes with large fan-out (hubs) are goodstarting points.
39
Link Structure Analysis
40
Page Rank
Simple solution: create a stochastic matrix of the Web:
– Each page i corresponds to row i and column i of the matrix
– If page j has n successors (links) then the ijth cell of the matrix is equal to 1/n if page i is one of these n succesors of page j, and 0 otherwise.
•21
41
Page Rank
The intuition behind this matrix:
� initially each page has 1 unit ofimportance.
� At each round, each page sharesimportance it has among its successors,and receives new importance from itspredecessors.
� The importance of each page reaches alimit after some steps
42
Page Rank – Example
� Assume that the Web consists of only three pages -A, B, and C. The links among these pages are shown below.
A
B
C
Let [a, b, c] bethe vector of importances for these three pages
A B C
A 1/2 1/2 0
B 1/2 0 1
C 0 1/2 0
•22
43
� The equation describing the asymptotic values of these three variables is:
a 1/2 1/2 0 a
b = 1/2 0 1 b
c 0 1/2 0 c
Page Rank
We can solve the equations like this one by starting with the
assumption a = b = c = 1, and applying the matrix to the current
estimate of these values repeatedly. The first four iterations give the
following estimates:
a = 1 1 5/4 9/8 5/4 … 6/5b = 1 3/2 1 11/8 17/16 … 6/5c = 1 1/2 3/4 1/2 11/16 ... 3/5
44
Problems with Real Web Graphs
� In the limit, the solution is a=b=6/5, c=3/5. That is, a and b each have the same importance, and twice of c.
� Problems with Real Web Graphs� dead ends: a page that has no succesors has nowhere to send its importance.
� Pages with no outlinks are “dead ends” for the random surfer
� Nowhere to go on next step
•23
45
a ½ ½ 0 ab = ½ 0 0 bc 0 ½ 0 c
Dead End– Page Rank� Assume now that the structure of the Web has
changed. The new matrix describing transitions is:
A
B
C
The first four steps of the iterative solution are:a = 1 1 3/4 5/8 1/2b = 1 1/2 1/2 3/8 5/16c = 1 1/2 1/4 1/4 3/16Eventually, each of a, b, and c become 0.
46
Spider traps
� A group of pages is a spider trap if there are
no links from within the group to outside
the group
� Random surfer gets trapped
� spider traps: a group of one or more pages that have no links out.
PageRank
•24
47
a ½ ½ 0 ab = ½ 0 0 bc 0 ½ 1 c
Spider Traps
• Assume now once more that the structure of the Web has changed. The new matrix describing transitions is:
A
B
C
The first four steps of the iterative solution are:a = 1 1 3/4 5/8 1/2b = 1 1/2 1/2 3/8 5/16c = 1 3/2 7/4 2 35/16c converges to 3, and a=b=0.
48
Google Solution
� Instead of applying the matrix directly, „tax” each page some fraction of its current importance, and distribute the taxed importance equally among all pages.
� Example: if we use 20% tax, the equation of the previous example becomes:
a = 0.8 * (½*a + ½ *b +0*c)
b = 0.8 * (½*a + 0*b + 0*c)
c = 0.8 * (0*a + ½*b + 1*c)
The solution to this equation is a=7/11, b=5/11, and c=21/11
•25
HITS (hypertext-induced topic selection)
� Kleinberg's hypertext-induced topicselection (HITS) algorithm is alsodeveloped for ranking documents basedon the link information among a set ofdocuments.
4/20/2015
49
Authorities and hubs
� The algorithm produces two types of pages:
� Authority: pages that provide an important, trustworthy information on a given topic.
� Hub: pages that contain links to authorities.
� Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs.
4/20/2015
50
•26
Authorities and hubs (2)
a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) 4/20/201551
1 1
5
6
7
1
2
3
4
5
7
6
HITS Algorithm
� Hubs point to lots of authorities.
� Authorities are pointed to by lots of hubs.
� Together they form a bipartite graph:
Hubs Authorities
4/20/2015
52
•27
Example For Hub and Authorities
4/20/2015
53
�The adjacency matrix A of a set of pages (nodes) defines
the linking structure.
�I Matrix element aij is 1 if node i references node j and 0
otherwise.
A =0 0 1
0 0 1
0 0 0
AT =0 0 0
0 0 0
1 1 0
Hub and Authorities
� The weights for authority nodes x can be determined
from the hub weights.
� a = AT . h
� And similarly the hub weights from the authority
weights.
� h = A . a
� Substituting into the equations we get
� a = AT . A. a
� h = A. AT. h
4/20/2015 54
•28
Example contd..
� Assume the initial hub weight vector is:
4/20/2015 55
h =1 11
�We compute the authority weight vector by:
a= AT . h =
0 0 0
0 0 0
1 1 0
.111
=
002
Example contd..
� The updated hub weight with the help of :
� h = A.a
4/20/2015 56
h = A . a =
0 0 1
0 0 1
0 0 0
.002
=220
�Once we got the hub weights(h) and authorities weights(a).
�Update these weights with help of :
�a = AT.A.a
�h = A.AT. h
•29
HITS vs PageRank
� HITS emphasizes mutual reinforcement between authority and hub
webpages, while PageRank does not attempt to capture the
distinction between hubs and authorities. It ranks pages just by
authority.
� HITS is applied to the local neighborhood of pages surrounding the
results of a query whereas PageRank is applied to the entire web
� HITS is query dependent but PageRank is query-independent.
� Both HITS and PageRank correspond to matrix computations.
� Both can be unstable: changing a few links can lead to quite
different rankings.
� PageRank doesn't handle pages with no outedges very well, because
they decrease the PageRank overall
4/20/2015
57
A Web is a collection of inter-related fileson one or more Web servers.
Web Usage Mining.
Discovery of meaningful patterns from datagenerated by client-server transactions.
Typical Sources of Data:
automatically generated data stored in serveraccess logs, referrer logs, agent logs, andclient-side cookies.
user profiles
58
Web Usage Mining
•30
Tries to predict user behavior frominteraction with the Web
Wide range of data (logs)
Web client data
Proxy server data
Web server data
Two common approaches
Maps the usage data of Web server intorelational tables before an adapted datamining techniques
Uses the log data directly by utilizing specialpre-processing techniques
59
Web Usage Mining
60
Data Sources- Logs
Every time an user link up at the web site the server keeps track in the log file of the click flow (click stream).
•31
61
Sample Web log
2006-10-04 04:24:39 203.200.95.194 - W3SVC195 NS 69.41.233.13 80 GET /STUDYCENTRE/StudyCentreSearchPage.asp - 200 0 0 568 210 HTTP/1.0 www.mcu.ac.in Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;) -http://www.mcu.ac.in/search_a_study_institute.htm
date(2006-10-04) time(04:24:39)Client side IP (203.200.95.194) Server site name(W3SVC195)Server Computer Name (NS), Server ip(69.41.233.13)Server port(80) method (GET)Client url(/STUDYCENTRE/StudyCentreSearchPage.asp)Server status code(200 ) Client side byte received (568)Time taken(210),User agent Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;)
CS_referer (http://www.mcu.ac.in/search_a_study_institute.htm)
Typical problems:
Distinguishing among unique users,server sessions, episodes, etc. in thepresence of caching and proxy servers.
Often Usage Mining uses somebackground or domain knowledge
E.g. site topology, Web content, etc.
62
Web Usage Mining
•32
Applications:
Two main categories:Learning a user profile (personalized)
Web users would be interested intechniques that learn their needs andpreferences automatically
Learning user navigation patterns(impersonalized)
Information providers would be interested intechniques that improve the effectiveness oftheir Web site
63
Web Usage Mining
Applications:
Personalization
Improve structure of a site’s Webpages
Aid in caching and prediction offuture page references
Improve design of individual pages
Improve effectiveness of e-commerce (sales and advertising)
64
Web Usage Mining
•33
65
Web Usage Mining (WUM)
The discovery of interesting user access patterns from Web server The discovery of interesting user access patterns from Web server logslogs
Generate simple statistical reportsGenerate simple statistical reports::A summary report of hits and bytes transferredA summary report of hits and bytes transferred
A list of top requested URLsA list of top requested URLs
A list of top referrersA list of top referrers
A list of most common browsers usedA list of most common browsers used
Hits per hour/day/week/month reportsHits per hour/day/week/month reports
Hits per domain reportsHits per domain reports
LearnLearn::Who is visiting you siteWho is visiting you site
The path visitors take through your pagesThe path visitors take through your pages
How much time visitors spend on each pageHow much time visitors spend on each page
The most common starting pageThe most common starting page
Where visitors are leaving your siteWhere visitors are leaving your site
66
Web Usage Mining – Three Phases
PrePre--ProcessingProcessing Pattern DiscoveryPattern Discovery Pattern AnalysisPattern Analysis
RawRaw
Sever logSever log
User sessionUser session
File File Rules and PatternsRules and Patterns Interesting Interesting
KnowledgeKnowledge
•34
Applications:
Server log is filtered (parsing).
Data cleaning routine attempt to deal with
Missing value
Smooth out noise
Identifying outlier
Data transformation
In to forms most appropriate for web usagemining
Web mining technique such asAssociation rule
Clustering
Classification 67
Web Log Mining
68
Summary report
� Request summary: request statistics for all modules/pages/files.
� Domain summary: request statistics from different domains.
� Event summary: statistics of the occurring of all events/actions.
� Session summary: statistics of sessions.� Bandwidth summary: statistics of generated network
traffic.� Error summary: statistics of all error messages.� Referring Organization summary: statistics of where the
users were from.� Agent summary: statistics of the use of different
browsers, etc
•35
Personalized information access
sourcespersonalization
serverreceivers
Personalization v. intelligence
� Better service for the user:
� Reduction of the information overload.
� More accurate information retrieval and extraction.
� Recommendation and guidance.
� Customer relationship management:
� Customer segmentation and targeted advertisement.
� Customer attraction and retention.
� Service improvement (site structure and content).
•36
User modeling
� Basic elements:
� Constructing models that can be used to adapt
the system to the user’s requirements.
� Different types of requirement: interests
(sports and finance news), knowledge level
(novice - expert), preferences (no-frame GUI),
etc.
� Different types of model: personal – generic.
� Knowledge discovery facilitates the
acquisition of user models from data.
� User model (type A): [PERSONAL]
User x -> sports, stock market
� User model (type B): [PERSONAL]
User x, Age 26, Male -> sports, stock market
� User community: [GENERIC]
Users {x,y,z} -> sports, stock market
� User stereotype: [GENERIC]
Users {x,y,z}, Age [20..30], Male -> sports, stock market
User model
•37
Learning user models
Community 1 Community 2 User communities
User 1 User 2 User 3 User 4 User 5
Observation of the users interacting with the system.
User models
Knowledge discovery process
Data collection
Data pre-processing
Pattern discovery
Knowledge post-processing
Collection of usage data by the server and the client.
Data cleaning, user identification, session identification
Construction of user models
Report generation, visualization, personalization module.
•38
Usage data sources
� Server-side data: access logs in Common or
Extended Log Format, user queries, cookies.
� Client-side data: Java and Javascript agents.
� Intermediary data: proxy logs, packet sniffers.
� Registration forms: personal information and
preferences supplied by the user.
� Demographic information: provided by census
databases.
Pre-processing usage data
� Cleaning: � Log entries that correspond to error responses.
� Trails of robots.
� Pages that have not been requested explicitly by the user (mainly image files, loaded automatically). Should be domain-specific.
� User identification: � Identification by log-in.
� Cookies and Javascript.
� Extended Log Format (browser and OS version).
� Bookmark user-specific URL.
� User session/Transaction identification in log files:� Time-based methods, e.g. 30 min silence interval. Problems with
cache. Partial solutions: special HTTP headers, Java agents.
•39
Collection and preparation
� Problems:� Privacy and security issues:
� The user must be aware of the data collected.
� Cookies and client-side agents are often disabled.
� Caching on the client or an intermediate proxy causes data loss on the server side.
� Registration forms are a nuisance and they are not reliable sources.
� User and session identification from server logs is hard.
� Different data required for different user models.
Personalized assistants
� Personalized crawling [Liebermann et al., ACM Comm., 2000]:
� The system knows the user (log-in).
� It uses heuristics to extract “important” terms from the Web pages that
the user visits and add them to thematic profiles.
� Each time the user views a page, the system:
� searches the Web for related pages,
� filters them according to the relevant thematic profile,
� and constructs a list of recommended links for the user.
� The Letizia version of the system searches the Web locally, following
outgoing links from the current page.
� The Powerscout version uses a search engine to explore the Web.