Web Mining B Tech_2015 .pdf

•1

Web: A huge, widely-distributed highlyheterogeneous, semi-structuredhypertext / hypermedia, interconnectedinformation repository.

Web is a huge collection of documentsplus

Hyper-link information

Access and usage information

1

Web Specifics

2

It is the use of data mining

techniques to automatically discover

and extract information from Web

documents/services

Etzioni CACM

Web Mining

•2

Some interesting problems on Web mining

Mining what Web search engine finds

Identification of authoritative Web pages

Identification of Web communities

Web document classification

Warehousing a Meta-Web: Web yellowpage service

Weblog mining (usage, access, andevolution)

Intelligent query answering in Websearch

3

A Few Themes in Web Mining

Web pages

– Content of actual web pages.

Intra-page structures– HTML or XML code for web pages. -

Anchor

Inter-page structures– Linkage structure between web pages

4

WEB DATA

•3

Usage data

- How web pages are accessed by end

user.

User Profile

- demographic and registrationinformation obtained about user.

- Information found in cookies

5

WEB DATA

6

WEB MINING TAXONOMY

•4

7

� Web page content mining

Summarization of web page contents

( Web ML queries)

� Web structure mining

Summarization of search engine result

( Page Rank)

� Web usage mining

mining for user browsing and access pattern, forunderstanding user’s behaviors on the Web.

(Web log Mining)

Over view of Web Mining

The taxonomy of web mining

“Web Content Mining is the processof extracting useful informationfrom the contents of Webdocuments.

It may consist of text, images,audio, video, or structured recordssuch as lists and tables.”

8

Web Content Mining

•5

“Web Content Mining examines thecontent of web pages as well asresult of web search”

� The first traditional searching ofweb pages via content , whilesecond is a further search of webpages found from previoussearch.

9

Web Content Mining

Web content consists of several types ofdata : Text, image, audio, video,hyperlinks.

Unstructured – free text

Semi-structured – HTML

More structured – Data in the tables ordatabase generated HTML pages

Note: much of the Web content data isunstructured text data.

10

Web Content Data Structure

•6

Noisy data

Example: Spelling mistakes

Not well structured text

Chat rooms

� “r u available ?”

� “Hey whazzzzzz up”

� “Yup”

11

Text characteristics

Information retrieval (IR) is the area ofstudy concerned with searching fordocuments, for information withindocuments, and for metadata aboutdocuments.

Indexing and retrieval of textualdocuments

Web search engines are the most visibleIR applications

Include online library catalog

12

Information retrieval (IR)

•7

Information Retrieval

� Given:

� A source of textual documents

� A user query (text based)

IR

System

Query

E.g. Text

Documents

source

• Find:

• A set (ranked) of documents that

are relevant to the queryRanked

Documents

Document

Document

Document

Intelligent Information Retrieval

� meaning of words

� Synonyms “buy” / “purchase”

� Ambiguity “bat” (baseball vs. cricket)

� order of words in the query

� user dependency for the data

� direct feedback

� indirect feedback

� authority of the source

� IBM is more likely to be an authorized

source then my second far cousin

•8

Itis a type of information retrieval whosegoal is to

automatically extract structured informationfrom unstructured and/or semi-structuredmachine-readable documents.

In most of the case this activity concernprocessing human language texts bymeans of natural language processing(NLP).

Goal is to allow logical reasoning to drawinferences based on the logical content ofthe input data

15

Information Extraction(IE)

Subtask of IE

1. Named entity extraction which could include:

a)Named entity recognition: recognition of

known entity names (for people and

organizations)

PERSON located in LOCATION (extracted

from the sentence “Amit is in Delhi.")

b)Relationship extraction: identification of

relations between entities, such as: PERSON

works for ORGANIZATION (extracted from

the sentence “Amit works for IBM.")

•9

Subtask of IE

2. Semi-structured information extraction :

Table extraction: finding and extracting tables

from documents. Comments extraction :

extracting comments from actual content of

article

3. Terminology extraction: finding the relevant

terms for a given corpus

� Given:

� A source of textual documents

� A well defined limited query (text based)

� Find:

� Sentences with relevantrelevant information

� Extract the relevant information and

ignore non-relevant information

(important!)

� Link related information and output in a

predetermined format

Information Extraction

•10

What is Information Extraction?

Extraction

System

Documents

source

Ranked

Documents

Relevant Info 1

Relevant Info 2

Relevant Info 3

Query 1(E.g. job title)

Query 2(E.g. salary)

Combine

Query Results

Mining the Web

IR / IE

System

Query

Documents

source

Ranked

Documents

1. Doc1

2. Doc2

3. Doc3

.

.

Web Spider

•11

Text mining process

Text Mining Process

� Text preprocessing

� Syntactic/Semantic text analysis

� Features Generation

� Bag of words

� Features Selection

� Simple counting

� Statistics

� Text/Data Mining

� Classification- Supervised learning

� Clustering- Unsupervised learning

� Analyzing results

•12

Text Preprocessing

Text Cleanup

e.g., remove ads from web pages, normalize text

converted from binary formats, deal with tables,

figures and formulas, …

Tokenization

– Splitting up a string of characters into a set of

tokens.

– Need to deal with issues like:

• Apostrophes, e.g., “John’s sick”, is it 1 or 2 tokens?

• Hyphens, e.g., database vs. data-base vs. data base.

• How should we deal with “C++”, “A/C”, “:-)”, “…”?

� Part Of Speech (pos) tagging� Find the corresponding pos for each word

e.g., John (noun) gave (verb) the (det) ball (noun)

� ~98% accurate.

� Parsing� Generates a parse treeparse tree (graph) for each sentence

� Each sentence is a stand alone graph

Syntactic / Semantic text analysis

•13

Feature Generation: Bag of words

� Text document is represented by the words it

contains (and their occurrences)

� e.g., “Lord of the rings” → {“the”, “Lord”, “rings”, “of”}

� Highly efficient

� Makes learning far simpler and easier

� Order of words is not that important for certain

applications

� Stemming: identifies a word by its root

� e.g., flying, flew → fly

� Reduce dimensionality

� Stop words: The most common words are unlikely

to help text mining

� e.g., “the”, “a”, “an”, “you” …

•14

28

Web search basics

The Web

Ad indexes

Web Results 1 - 10 of about 7,310,000 for miele. (0.12 seconds)

Miele, Inc -- Anything else is a compromise At the heart of your home, Appliances by Miele. ... USA. to miele.com. Residential Appliances. Vacuum Cleaners. Dishwashers. Cooking Appliances. Steam Oven. Coffee System ... www.miele.com/ - 20k - Cached - Similar pages

Miele Welcome to Miele, the home of the very best appliances and kitchens in the world. www.miele.co.uk/ - 3k - Cached - Similar pages

Miele - Deutscher Hersteller von Einbaugeräten, Hausgeräten ... - [ Translate this page ] Das Portal zum Thema Essen & Geniessen online unter www.zu-tisch.de. Miele weltweit ...ein Leben lang. ... Wählen Sie die Miele Vertretung Ihres Landes. www.miele.de/ - 10k - Cached - Similar pages

Herzlich willkommen bei Miele Österreich - [ Translate this page ] Herzlich willkommen bei Miele Österreich Wenn Sie nicht automatisch weitergeleitet werden, klicken Sie bitte hier! HAUSHALTSGERÄTE ... www.miele.at/ - 3k - Cached - Similar pages

Sponsored Links

CG Appliance Express Discount Appliances (650) 756-3931 Same Day Certified Installation www.cgappliance.com San Francisco-Oakland-San Jose, CA Miele Vacuum Cleaners Miele Vacuums- Complete Selection Free Shipping! www.vacuums.com Miele Vacuum Cleaners Miele-Free Air shipping! All models. Helpful advice. www.best-vacuum.com

Web crawler

Indexer

Indexes

Search

User

•15

Robot (spider) is a program that traversesthe hypertext sructure in the Web.

Collect information from visited pages

The Page(or pages) that the crawler starts withare referred to as the seed URL

Used to construct indexes for search engines

Periodic Crawler – may visit a certainnumber of pages and then stop, built anindex, and replace the existing index.

It is activated periodically and serach portionsof the Web. ( Notices and Circular)

29

Crawlers

Incremental Crawler – selectively searchesthe Web and only updates the indexincrementally as oppose to replacing them.incrementally modifies index

Focused Crawler – visits pages related to aparticular subject ( Topic of Interest)

30

Crawlers

•16

Only visit links from a page if that page isdetermined to be relevant.

A major piece of architecture is a hypertextthat associate relevance score for eachdocument with respect to crawl topic.

Distiller is used to determine which pagescontain link to many relevant pages. Theseare called hub pages.

The hub pages may not contain the relevantinformation, but they would be quiteimportant to facilitate continue searching

31

Focused Crawler

© Prentice Hall 32

Focused Crawler

•17

33

� Locating documents and services on the Web:

� WebCrawler, Alta Vista (http://www.altavista.com): scan millions of Web documents and create index of words

� Web robots : are software programs that

automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information.

� ShopBot -Retrieve product information from a

variety of vendor sites using only general information about the product domain.

Intelligent Search Agents

34

Web Outlier Mining

Web Outliers are data objects that show

significantly different characteristics

than other web data.

Web Outlier mining is the discovery and

analysis of rare and interesting pattern

from the web.

•18

Interested in the structure of thehyperlinks within the Web

Inspired by the study of social networksand citation analysis

Can discover specific types of pages(suchas hubs, authorities, etc.) based on theincoming and outgoing links.

Application:Discovering micro-communities in the Web ,

measuring the “completeness” of a Web site

35

Web Structure Mining

There are two approaches:

page rank: for discovering the mostimportant pages on the Web (as used inGoogle)

hubs and authorities: a more detailedevaluation of the importance of Webpages

Basic definition of importance:A page is important if important pages link toit

36


•19

(1970) Researchers proposed methods ofusing citations among journal articles toevaluate the quality of reserach papers.

Unlike journal citations, the Web linkagehas some unique features:

one authority page will seldom have its Webpage point to its competitive authorities(CocaCola --�� Pepsi)

authoritative pages are seldom descriptive(Yahoo! may not contain the description „Websearch engine”)

37


38

Predecessors and Successors of a Web Page

= =

Predecessors Successors

•20

Improve information retrieval by scoringWeb pages according to their importancein the Web or a thematic sub-domain of it.

Nodes with large fan-in (authorities)provide high quality information.

Nodes with large fan-out (hubs) are goodstarting points.

39

Link Structure Analysis

40

Page Rank

Simple solution: create a stochastic matrix of the Web:

– Each page i corresponds to row i and column i of the matrix

– If page j has n successors (links) then the ijth cell of the matrix is equal to 1/n if page i is one of these n succesors of page j, and 0 otherwise.

•21

41

Page Rank

The intuition behind this matrix:

� initially each page has 1 unit ofimportance.

� At each round, each page sharesimportance it has among its successors,and receives new importance from itspredecessors.

� The importance of each page reaches alimit after some steps

42

Page Rank – Example

� Assume that the Web consists of only three pages -A, B, and C. The links among these pages are shown below.

A

B

C

Let [a, b, c] bethe vector of importances for these three pages

A B C

A 1/2 1/2 0

B 1/2 0 1

C 0 1/2 0

•22

43

� The equation describing the asymptotic values of these three variables is:

a 1/2 1/2 0 a

b = 1/2 0 1 b

c 0 1/2 0 c

Page Rank

We can solve the equations like this one by starting with the

assumption a = b = c = 1, and applying the matrix to the current

estimate of these values repeatedly. The first four iterations give the

following estimates:

a = 1 1 5/4 9/8 5/4 … 6/5b = 1 3/2 1 11/8 17/16 … 6/5c = 1 1/2 3/4 1/2 11/16 ... 3/5

44

Problems with Real Web Graphs

� In the limit, the solution is a=b=6/5, c=3/5. That is, a and b each have the same importance, and twice of c.

� Problems with Real Web Graphs� dead ends: a page that has no succesors has nowhere to send its importance.

� Pages with no outlinks are “dead ends” for the random surfer

� Nowhere to go on next step

•23

45

a ½ ½ 0 ab = ½ 0 0 bc 0 ½ 0 c

Dead End– Page Rank� Assume now that the structure of the Web has

changed. The new matrix describing transitions is:

A

B

C

The first four steps of the iterative solution are:a = 1 1 3/4 5/8 1/2b = 1 1/2 1/2 3/8 5/16c = 1 1/2 1/4 1/4 3/16Eventually, each of a, b, and c become 0.

46

Spider traps

� A group of pages is a spider trap if there are

no links from within the group to outside

the group

� Random surfer gets trapped

� spider traps: a group of one or more pages that have no links out.

PageRank

•24

47

a ½ ½ 0 ab = ½ 0 0 bc 0 ½ 1 c

Spider Traps

• Assume now once more that the structure of the Web has changed. The new matrix describing transitions is:

A

B

C

The first four steps of the iterative solution are:a = 1 1 3/4 5/8 1/2b = 1 1/2 1/2 3/8 5/16c = 1 3/2 7/4 2 35/16c converges to 3, and a=b=0.

48

Google Solution

� Instead of applying the matrix directly, „tax” each page some fraction of its current importance, and distribute the taxed importance equally among all pages.

� Example: if we use 20% tax, the equation of the previous example becomes:

a = 0.8 * (½*a + ½ *b +0*c)

b = 0.8 * (½*a + 0*b + 0*c)

c = 0.8 * (0*a + ½*b + 1*c)

The solution to this equation is a=7/11, b=5/11, and c=21/11

•25

HITS (hypertext-induced topic selection)

� Kleinberg's hypertext-induced topicselection (HITS) algorithm is alsodeveloped for ranking documents basedon the link information among a set ofdocuments.

4/20/2015

49

Authorities and hubs

� The algorithm produces two types of pages:

� Authority: pages that provide an important, trustworthy information on a given topic.

� Hub: pages that contain links to authorities.

� Authorities and hubs exhibit a mutually reinforcing relationship: a better hub points to many good authorities, and a better authority is pointed to by many good hubs.

4/20/2015

50

•26

Authorities and hubs (2)

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7) 4/20/201551

1 1

5

6

7

1

2

3

4

5

7

6

HITS Algorithm

� Hubs point to lots of authorities.

� Authorities are pointed to by lots of hubs.

� Together they form a bipartite graph:

Hubs Authorities

4/20/2015

52

•27

Example For Hub and Authorities

4/20/2015

53

�The adjacency matrix A of a set of pages (nodes) defines

the linking structure.

�I Matrix element aij is 1 if node i references node j and 0

otherwise.

A =0 0 1

0 0 1

0 0 0

AT =0 0 0

0 0 0

1 1 0

Hub and Authorities

� The weights for authority nodes x can be determined

from the hub weights.

� a = AT . h

� And similarly the hub weights from the authority

weights.

� h = A . a

� Substituting into the equations we get

� a = AT . A. a

� h = A. AT. h

4/20/2015 54

•28

Example contd..

� Assume the initial hub weight vector is:

4/20/2015 55

h =1 11

�We compute the authority weight vector by:

a= AT . h =

0 0 0

0 0 0

1 1 0

.111

=

002

Example contd..

� The updated hub weight with the help of :

� h = A.a

4/20/2015 56

h = A . a =

0 0 1

0 0 1

0 0 0

.002

=220

�Once we got the hub weights(h) and authorities weights(a).

�Update these weights with help of :

�a = AT.A.a

�h = A.AT. h

•29

HITS vs PageRank

� HITS emphasizes mutual reinforcement between authority and hub

webpages, while PageRank does not attempt to capture the

distinction between hubs and authorities. It ranks pages just by

authority.

� HITS is applied to the local neighborhood of pages surrounding the

results of a query whereas PageRank is applied to the entire web

� HITS is query dependent but PageRank is query-independent.

� Both HITS and PageRank correspond to matrix computations.

� Both can be unstable: changing a few links can lead to quite

different rankings.

� PageRank doesn't handle pages with no outedges very well, because

they decrease the PageRank overall

4/20/2015

57

A Web is a collection of inter-related fileson one or more Web servers.

Web Usage Mining.

Discovery of meaningful patterns from datagenerated by client-server transactions.

Typical Sources of Data:

automatically generated data stored in serveraccess logs, referrer logs, agent logs, andclient-side cookies.

user profiles

58

Web Usage Mining

•30

Tries to predict user behavior frominteraction with the Web

Wide range of data (logs)

Web client data

Proxy server data

Web server data

Two common approaches

Maps the usage data of Web server intorelational tables before an adapted datamining techniques

Uses the log data directly by utilizing specialpre-processing techniques

59

Web Usage Mining

60

Data Sources- Logs

Every time an user link up at the web site the server keeps track in the log file of the click flow (click stream).

•31

61

Sample Web log

2006-10-04 04:24:39 203.200.95.194 - W3SVC195 NS 69.41.233.13 80 GET /STUDYCENTRE/StudyCentreSearchPage.asp - 200 0 0 568 210 HTTP/1.0 www.mcu.ac.in Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;) -http://www.mcu.ac.in/search_a_study_institute.htm

date(2006-10-04) time(04:24:39)Client side IP (203.200.95.194) Server site name(W3SVC195)Server Computer Name (NS), Server ip(69.41.233.13)Server port(80) method (GET)Client url(/STUDYCENTRE/StudyCentreSearchPage.asp)Server status code(200 ) Client side byte received (568)Time taken(210),User agent Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;)

CS_referer (http://www.mcu.ac.in/search_a_study_institute.htm)

Typical problems:

Distinguishing among unique users,server sessions, episodes, etc. in thepresence of caching and proxy servers.

Often Usage Mining uses somebackground or domain knowledge

E.g. site topology, Web content, etc.

62

Web Usage Mining

•32

Applications:

Two main categories:Learning a user profile (personalized)

Web users would be interested intechniques that learn their needs andpreferences automatically

Learning user navigation patterns(impersonalized)

Information providers would be interested intechniques that improve the effectiveness oftheir Web site

63

Web Usage Mining

Applications:

Personalization

Improve structure of a site’s Webpages

Aid in caching and prediction offuture page references

Improve design of individual pages

Improve effectiveness of e-commerce (sales and advertising)

64

Web Usage Mining

•33

65

Web Usage Mining (WUM)

The discovery of interesting user access patterns from Web server The discovery of interesting user access patterns from Web server logslogs

Generate simple statistical reportsGenerate simple statistical reports::A summary report of hits and bytes transferredA summary report of hits and bytes transferred

A list of top requested URLsA list of top requested URLs

A list of top referrersA list of top referrers

A list of most common browsers usedA list of most common browsers used

Hits per hour/day/week/month reportsHits per hour/day/week/month reports

Hits per domain reportsHits per domain reports

LearnLearn::Who is visiting you siteWho is visiting you site

The path visitors take through your pagesThe path visitors take through your pages

How much time visitors spend on each pageHow much time visitors spend on each page

The most common starting pageThe most common starting page

Where visitors are leaving your siteWhere visitors are leaving your site

66

Web Usage Mining – Three Phases

PrePre--ProcessingProcessing Pattern DiscoveryPattern Discovery Pattern AnalysisPattern Analysis

RawRaw

Sever logSever log

User sessionUser session

File File Rules and PatternsRules and Patterns Interesting Interesting

KnowledgeKnowledge

•34

Applications:

Server log is filtered (parsing).

Data cleaning routine attempt to deal with

Missing value

Smooth out noise

Identifying outlier

Data transformation

In to forms most appropriate for web usagemining

Web mining technique such asAssociation rule

Clustering

Classification 67

Web Log Mining

68

Summary report

� Request summary: request statistics for all modules/pages/files.

� Domain summary: request statistics from different domains.

� Event summary: statistics of the occurring of all events/actions.

� Session summary: statistics of sessions.� Bandwidth summary: statistics of generated network

traffic.� Error summary: statistics of all error messages.� Referring Organization summary: statistics of where the

users were from.� Agent summary: statistics of the use of different

browsers, etc

•35

Personalized information access

sourcespersonalization

serverreceivers

Personalization v. intelligence

� Better service for the user:

� Reduction of the information overload.

� More accurate information retrieval and extraction.

� Recommendation and guidance.

� Customer relationship management:

� Customer segmentation and targeted advertisement.

� Customer attraction and retention.

� Service improvement (site structure and content).

•36

User modeling

� Basic elements:

� Constructing models that can be used to adapt

the system to the user’s requirements.

� Different types of requirement: interests

(sports and finance news), knowledge level

(novice - expert), preferences (no-frame GUI),

etc.

� Different types of model: personal – generic.

� Knowledge discovery facilitates the

acquisition of user models from data.

� User model (type A): [PERSONAL]

User x -> sports, stock market

� User model (type B): [PERSONAL]

User x, Age 26, Male -> sports, stock market

� User community: [GENERIC]

Users {x,y,z} -> sports, stock market

� User stereotype: [GENERIC]

Users {x,y,z}, Age [20..30], Male -> sports, stock market

User model

•37

Learning user models

Community 1 Community 2 User communities

User 1 User 2 User 3 User 4 User 5

Observation of the users interacting with the system.

User models

Knowledge discovery process

Data collection

Data pre-processing

Pattern discovery

Knowledge post-processing

Collection of usage data by the server and the client.

Data cleaning, user identification, session identification

Construction of user models

Report generation, visualization, personalization module.

•38

Usage data sources

� Server-side data: access logs in Common or

Extended Log Format, user queries, cookies.

� Client-side data: Java and Javascript agents.

� Intermediary data: proxy logs, packet sniffers.

� Registration forms: personal information and

preferences supplied by the user.

� Demographic information: provided by census

databases.

Pre-processing usage data

� Cleaning: � Log entries that correspond to error responses.

� Trails of robots.

� Pages that have not been requested explicitly by the user (mainly image files, loaded automatically). Should be domain-specific.

� User identification: � Identification by log-in.

� Cookies and Javascript.

� Extended Log Format (browser and OS version).

� Bookmark user-specific URL.

� User session/Transaction identification in log files:� Time-based methods, e.g. 30 min silence interval. Problems with

cache. Partial solutions: special HTTP headers, Java agents.

•39

Collection and preparation

� Problems:� Privacy and security issues:

� The user must be aware of the data collected.

� Cookies and client-side agents are often disabled.

� Caching on the client or an intermediate proxy causes data loss on the server side.

� Registration forms are a nuisance and they are not reliable sources.

� User and session identification from server logs is hard.

� Different data required for different user models.

Personalized assistants

� Personalized crawling [Liebermann et al., ACM Comm., 2000]:

� The system knows the user (log-in).

� It uses heuristics to extract “important” terms from the Web pages that

the user visits and add them to thematic profiles.

� Each time the user views a page, the system:

� searches the Web for related pages,

� filters them according to the relevant thematic profile,

� and constructs a list of recommended links for the user.

� The Letizia version of the system searches the Web locally, following

outgoing links from the current page.

� The Powerscout version uses a search engine to explore the Web.

Web Mining B Tech_2015 .pdf

Documents

Transcript of Web Mining B Tech_2015 .pdf