BIDM

7/29/2019 BIDM

1/122

7/29/2019 BIDM

2/122

Business Intelligence and Data Mining (BI &DM)

Text Book:

Business Intelligence A Managerial Approach by

Efraim Turban, Ramesh Sharda, Dursun Delen and

Devid King, 2/e, Pearson, 2012

Reference Material:

Decision Support and Business Intelligence

Systems by Efraim Turban, Ramesh Sharda andDursun Delen, 9/e, Pearson, 2012

7/29/2019 BIDM

3/122

7/29/2019 BIDM

4/122

Sessions Plan Introduction to Business Intelligence

Decision Support Systems Concepts, Methodologies

and Technologies Data Warehousing

Business Performance Management

Data Mining for Business Intelligence Text and Web Mining

Business Intelligence: Implementation and Emerging

Trends


7/29/2019 BIDM

5/122

Introduction to Text andWeb Mining


7/29/2019 BIDM

6/122

7/29/2019 BIDM

7/122

Learning Objectives

Describe Web mining, its objectives, and its

benefits

Understand the three different branches of Web

mining Web content mining

Web structure mining

Web usage mining

Understand the applications of these three mining

paradigms

7/29/2019 BIDM

8/122

Opening Vignette

Mining Text For Security And Counterterrorism

What is MITRE?

Problem description Proposed solution

Results

Answer & discuss the case questions

7/29/2019 BIDM

9/122

7/29/2019 BIDM

10/122

7/29/2019 BIDM

11/122

7/29/2019 BIDM

12/122

7/29/2019 BIDM

13/122

7/29/2019 BIDM

14/122

What is Text-Mining?

finding interesting regularities in large

textualdatasets (adapted from Usama Fayad)

where interesting means: non-trivial, hidden,

previously unknown and potentially useful

finding semantic and abstract information

from the surface form of textual data

7/29/2019 BIDM

15/122

7/29/2019 BIDM

16/122

7/29/2019 BIDM

17/122

Semi-Structured Data

Text databases are, in general, semi-structured Example:

Title

Author Publication_Date

Length

Category

Abstruct

Content

Structured attributes/value pair

Unstructured

7/29/2019 BIDM

18/122

Text Mining Process

Text preprocessing Syntactic/Semantic

text analysis

Features Generation Bag of words

Features Selection Simple counting

Statistics

Text/Data Mining Classification

Clustering Associations

Analyzing results

7/29/2019 BIDM

19/122

7/29/2019 BIDM

20/122

7/29/2019 BIDM

21/122

7/29/2019 BIDM

22/122

Levels of text representations

Character (character n-grams and sequences)

Words (stop-words, stemming, lemmatization)

Phrases (word n-grams, proximity features)

Part-of-speech tags

Taxonomies / thesauri Vector-space model

Language models

Full-parsing

Cross-modality

Collaborative tagging / Web2.0

Templates / Frames

Ontologies / First order theories

7/29/2019 BIDM

23/122


Character

Words

Phrases

Part-of-speech tags

Taxonomies / thesauri

Vector-space model

Language models

Full-parsing

Cross-modality

7/29/2019 BIDM

24/122

Character level

Character level representation of a text

consists from sequences of characters

a document is represented by a frequency

distribution of sequences

Usually we deal with contiguous strings

each character sequence of length 1, 2, 3,

represent a feature with its frequency

7/29/2019 BIDM

25/122

7/29/2019 BIDM

26/122


Character

Words

Phrases

Part-of-speech tags


Vector-space model

Language models

Full-parsing

Cross-modality

7/29/2019 BIDM

27/122

Word level

The most common representation of text used

for many techniques

there are many tokenization software packages

which split text into the words

Important to know:

Word is well defined unit in western languages

e.g. Chinese has different notion of semantic unit

7/29/2019 BIDM

28/122

Words Properties

Relations among word surface forms and their senses:

Homonomy: same form, but different meaning (e.g. bank:river bank, financial institution)

Polysemy: same form, related meaning (e.g. bank: bloodbank, financial institution)

Synonymy: different form, same meaning (e.g. singer,vocalist)

Hyponymy: one word denotes a subclass of an another(e.g. breakfast, meal)

Word frequencies in texts have power distribution:

small number of very frequent words

big number of low frequency words

7/29/2019 BIDM

29/122

7/29/2019 BIDM

30/122


Character

Words

Phrases

Part-of-speech tags


Language models

Full-parsing

Cross-modality

7/29/2019 BIDM

31/122

Phrase level

Instead of having just single words we candeal with phrases

We use two types of phrases:

Phrases as frequent contiguous word sequences Phrases as frequent non-contiguous word

sequences

both types of phrases could be identified bysimple dynamic programming algorithm

The main effect of using phrases is to moreprecisely identify sense

7/29/2019 BIDM

32/122


Character

Words

Phrases

Part-of-speech tags


Vector-space model

Language models

Full-parsing

Cross-modality

7/29/2019 BIDM

33/122

Part-of-Speech level

By introducing part-of-speech tags we introduce word-

types enabling to differentiate words functions

For text-analysis part-of-speech information is used mainly for

information extraction where we are interested in e.g. named

entities which are noun phrases

Another possible use is reduction of the vocabulary (features)

it is known that nouns carry most of the information in text

documents

Part-of-Speech taggers are usually learned by HMMalgorithm on manually tagged data

7/29/2019 BIDM

34/122

Part-of-Speech Table

http://www.englishclub.com/grammar/parts-of-speech_1.htm
http://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htm

7/29/2019 BIDM

35/122

7/29/2019 BIDM

36/122


Character

Words

Phrases

Part-of-speech tags


Vector-space model

Language models

Full-parsing

Cross-modality

7/29/2019 BIDM

37/122

7/29/2019 BIDM

38/122

WordNet database of lexical relations

WordNet is the most well developedand widely used lexical database forEnglish it consist from 4 databases (nouns,

verbs, adjectives, and adverbs)

Each database consists from senseentries each sense consists from aset of synonyms, e.g.: musician, instrumentalist, player

person, individual, someone

life form, organism, being

Category Unique

Forms

Number

of

Senses

Noun 94474 116317

Verb 10319 22066

Adjective 20170 29881

Adverb 4546 5677

7/29/2019 BIDM

39/122

7/29/2019 BIDM

40/122

WordNet relations

Each WordNet entry is connected with other entries in the graph through

relations Relations in the database of nouns:

Relation Definition Example

Hypernym From lower to higher

concepts

breakfast -> meal

Hyponym From concepts to

subordinates

meal -> lunch

Has-Member From groups to their

members

faculty -> professor

Member-Of From members to their

groups

copilot -> crew

Has-Part From wholes to parts table -> leg

Part-Of From parts to wholes course -> meal

Antonym Opposites leader -> follower

D t R t ti

7/29/2019 BIDM

41/122

Document Representation

A document representation aims to capture

what the document is about

One possible approach

Each entry describes a document

Attribute describe whether or not a term appears

in the document

Term

Camera Digital Memory Pixel

Document 1 1 1 0 1

Document 2 1 1 0 0

- - - - -

7/29/2019 BIDM

42/122

7/29/2019 BIDM

43/122


Character

Words

Phrases

Part-of-speech tags


Language models

Full-parsing

Cross-modality

7/29/2019 BIDM

44/122

7/29/2019 BIDM

45/122

7/29/2019 BIDM

46/122

7/29/2019 BIDM

47/122

Distance Based Matching

In order retrieve documents similar to a given documentone need a measure of similarity

Euclidean distance

The Euclidean distance between

X = (x1, x2, x3, .., xn) and Y = (y1,y2,y3, .., yn)Is defined as

D(X,Y) = (xiyi)2

Similarity between document

7/29/2019 BIDM

48/122

Similarity between document

vectors

Each document is represented as a vector of weights D =

Cosine similarity (dot product) is the most widely used

similarity measure between two document vectors

calculates cosine of the angle between document vectors

efficient to calculate (sum of products of intersecting words)

similarity value between 0 (different) and 1 (the same)

k kj j

i

ii

xx

xxDDSim

22

21

21 ),(

P f M

7/29/2019 BIDM

49/122

Performance Measure The set of retrieved documents can be formed by collecting

the top-ranking documents according to a similarity measure The quality of a collection can be compared by the two

following measures

Relevant Relevant

Precision = ------------------------------------Retrieved

Relevant Retrieved

Recall = ------------------------------------------

Relevant

Relevant

Documents

Relevant &

Retrieved

Retrieved

Documents

7/29/2019 BIDM

50/122

7/29/2019 BIDM

51/122

Cluster Analysis for Data Mining

Analysis methods

Statistical methods (including both hierarchical

and nonhierarchical), such as k-means, k-modes,

and so on. Neural networks (adaptive resonance theory

[ART], self-organizing map [SOM])

Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms

Divisive versus Agglomerative methods

T t Mi i f P t t A l i

7/29/2019 BIDM

52/122

Text Mining for Patent Analysis

(see Applications Case 7.2)

What is a patent?

exclusive rights granted by a country to an

inventor for a limited period of time in exchange

for a disclosure of an invention

How do we do patent analysis (PA)?

Why do we need to do PA?

What are the benefits?

What are the challenges?

How does text mining help in PA?

7/29/2019 BIDM

53/122

Natural Language Processing (NLP)

Structuring a collection of text Old approach: bag-of-words

New approach: natural language processing

NLP is a very important concept in text mining.

a subfield of artificial intelligence and computational

linguistics.

the study of "understanding" the natural humanlanguage.

Syntax versus semantics based text mining

7/29/2019 BIDM

54/122

7/29/2019 BIDM

55/122


Challenges in NLP Part-of-speech tagging

Text segmentation

Word sense disambiguation

Syntax ambiguity

Imperfect or irregular input

Speech acts

Dream of AI community

to have algorithms that are capable of automatically

reading and obtaining knowledge from text

7/29/2019 BIDM

56/122


WordNet A laboriously hand-coded database of English words,

their definitions, sets of synonyms, and various semantic

relations between synonym sets

A major resource for NLP

Needs automation to be completed

Sentiment Analysis

A technique used to detect favorable and unfavorableopinions toward specific products and services

See Application Case 7.3 for a CRM application

7/29/2019 BIDM

57/122

NLP Task Categories

Information retrieval

Information extraction

Named-entity recognition

Question answering

Automatic summarization

Natural language generation & understanding

Machine translation

Foreign language reading & writing Speech recognition

Text proofing

Optical character recognition

7/29/2019 BIDM

58/122

Text Mining Applications

Marketing applications Enables better CRM

Security applications

ECHELON, OASIS

Deception detection

example coming up

Medicine and biology

Literature-based gene identification example coming up

Academic applications

Research stream analysis - example coming up

T t Mi i A li ti

7/29/2019 BIDM

59/122

Text Mining Applications(gene/protein interaction identification)

Gene/

Protein 596 12043 24224 281020 42722 397276

D007962

D 016923

D 001773

D019254 D044465 D001769 D002477 D003643 D016158

185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523

NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP NP PP NP NP PP NP

Ontology

Word

POS

Shallow

Parse

...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.

7/29/2019 BIDM

60/122

7/29/2019 BIDM

61/122

Text Mining Process

Establish the Corpus:Collect & Organize the

Domain Specific

Unstructured Data

Create the Term-Document Matrix:Introduce Structure

to the Corpus

Extract Knowledge:Discover Novel

Patterns from the

T-D Matrix

The inputs to the process

includes a variety of relevant

unstructured (and semi-

structured) data sources such

as text, XML, HTML, etc.

The output of the Task 1 is a

collection of documents in

some digitized format for

computer processing

The output of the Task 2 is a

flat file called term-document

matrix where the cells are

populated with the term

frequencies

The output of Task 3 is a

number of problem specific

classification, association,

clustering models and

visualizations

Task 1 Task 2 Task 3

FeedbackFeedback

The three-step text mining process

7/29/2019 BIDM

62/122

Text Mining Process

Step 1: Establish the corpus

Collect all relevant unstructured data (e.g.,

textual documents, XML files, emails, Web

pages, short notes, voice recordings) Digitize, standardize the collection (e.g.,

all in ASCII text files)

Place the collection in a common place (e.g.,

in a flat file, or in a directory as separate files)

7/29/2019 BIDM

63/122

Text Mining Process

Step 2: Create the TermbyDocument Matrix

inv

estm

entri

sk

p

roject

man

agem

ent

softw

areen

ginee

ring

d

evelo

pment

1

SAP

...

Document 1

Document 2

Document 3

Document 4

Document 5

Document 6

...

Documents

Terms

1

1

1

2

1

1

1

3

1

7/29/2019 BIDM

64/122

Text Mining Process

Step 2: Create the TermbyDocumentMatrix (TDM)

Should all terms be included?

Stop words, include words Synonyms, homonyms

Stemming

What is the best representation of the indices

(values in cells)?

Row counts; binary frequencies; log frequencies;

Inverse document frequency

7/29/2019 BIDM

65/122

Text Mining Process

Step 2: Create the TermbyDocumentMatrix (TDM)

TDM is a sparse matrix. How can we reduce the

dimensionality of the TDM? Manual a domain expert goes through it

Eliminate terms with very few occurrences in very

few documents (?)

Transform the matrix using singular valuedecomposition (SVD)

SVD is similar to principle component analysis

7/29/2019 BIDM

66/122

Text Mining Process

Step 2: Extract patterns/knowledge Classification (text categorization)

Clustering (natural groupings of text)

Improve search recall Improve search precision

Scatter/gather

Query-specific clustering

Association

Trend Analysis ()

7/29/2019 BIDM

67/122

Web Mining The term created by Orem Etzioni (1996)

Application of data mining techniques toautomatically discover and extract information from

Web data

7/29/2019 BIDM

68/122

What is Web Mining?

Discovering useful information from the

World-Wide Web and its usage patterns

7/29/2019 BIDM

69/122

Web Mining v. Data Mining

Structure (or lack of it)

Textual information and linkage structure

Scale

Data generated per day is comparable to largest

conventional data warehouses

Speed

Often need to react to evolving usage patterns inreal-time (e.g., merchandising)

7/29/2019 BIDM

70/122

Web Mining topics

Web graph analysis

Power Laws and The Long Tail

Structured data extraction

Web advertising

Systems Issues

7/29/2019 BIDM

71/122

Web Mining topics

Web graph analysis



Web advertising

Systems Issues

7/29/2019 BIDM

72/122

Web Mining topics

Web graph analysis



Web advertising

Systems Issues

7/29/2019 BIDM

73/122

Size of the Web

Number of pages Technically, infinite

Much duplication (30-40%)

Best estimate of unique static HTML pagescomes from search engine claims

Until last year, Google claimed 8 billion(?), Yahooclaimed 20 billion

Google recently announced that their index contains 1trillion pages

How to explain the discrepancy?

7/29/2019 BIDM

74/122

The web as a graph

Pages = nodes, hyperlinks = edges

Ignore content

Directed graph

High linkage

10-20 links/page on average

Power-law degree distribution

7/29/2019 BIDM

75/122

Structure of Web graph

Lets take a closer look at structure

Broder et al (2000) studied a crawl of 200M pages

and other smaller crawls

Bow-tie structure Not a small world

7/29/2019 BIDM

76/122

Bow-tie Structure

Source: Broder et al, 2000

7/29/2019 BIDM

77/122

7/29/2019 BIDM

78/122

Web Mining topics

Web graph analysis



Web advertising

Systems Issues

l d d b

7/29/2019 BIDM

79/122

Power-law degree distribution

Source: Broder et al, 2000

l l

7/29/2019 BIDM

80/122

Power-laws galore

Structure

In-degrees

Out-degrees

Number of pages per site

Usage patterns

Number of visitors

Popularity e.g., products, movies, music

h il

7/29/2019 BIDM

81/122

The Long Tail

Source: Chris Anderson (2004)

W b Mi i i

7/29/2019 BIDM

82/122

Web Mining topics

Web graph analysis



Web advertising

Systems Issues

E i S d D

7/29/2019 BIDM

83/122

Extracting Structured Data

http://www.simplyhired.com

E i d d

7/29/2019 BIDM

84/122

Extracting structured data

http://www.fatlens.com

W b Mi i t i

7/29/2019 BIDM

85/122

Web Mining topics

Web graph analysis



Web advertising

Systems Issues

Ad h lt

7/29/2019 BIDM

86/122

Ads vs. search results

Ad h lt

7/29/2019 BIDM

87/122

Ads vs. search results

Search advertising is the revenue model

Multi-billion-dollar industry

Advertisers pay for clicks on their ads

Interesting problems

What ads to show for a search?

If Im an advertiser, which search terms should I

bid on and how much to bid?

T A h t A l i D t

7/29/2019 BIDM

88/122

Two Approaches to Analyzing Data

Machine Learning approach

Emphasizes sophisticated algorithms e.g., Support

Vector Machines

Data sets tend to be small, fit in memory

Data Mining approach

Emphasizes big data sets (e.g., in the terabytes)

Data cannot even fit on a single disk! Necessarily leads to simpler algorithms

W b Mi i t i

7/29/2019 BIDM

89/122

Web Mining topics

Web graph analysis



Web advertising

Systems Issues

S t hit t

7/29/2019 BIDM

90/122

Systems architecture

Memory

Disk

CPU

Machine Learning, Statistics

Classical Data Mining

V L S l D t Mi i

7/29/2019 BIDM

91/122

Very Large-Scale Data Mining

Mem

Disk

CPU

Mem

Disk

CPU

Mem

Disk

CPU

Cluster of commodity nodes

Systems Issues

7/29/2019 BIDM

92/122

Systems Issues

Web data sets can be very large

Tens to hundreds of terabytes

Cannot mine on a single server!

Need large farms of servers

How to organize hardware/software to mine

multi-terabye data sets

Without breaking the bank!

Project

7/29/2019 BIDM

93/122

Project

Lots of interesting project ideas If you cant think of one please come discuss with us

Infrastructure

Aster Data cluster on Amazon EC2

Supports both MapReduce and SQL Data

Netflix

ShareThis

Google WebBase

TREC

Data Mining vs Web Mining

7/29/2019 BIDM

94/122

Data Mining vs. Web Mining

Traditional data mining data is structured and relational

well-defined tables, columns, rows, keys, and

constraints.

Web data

Semi-structured and unstructured

readily available data rich in features and patterns

7/29/2019 BIDM

95/122

Web Data

Web Structure

tag Click here to

Shop Online
http://www.walmart.com/http://www.walmart.com/

7/29/2019 BIDM

96/122

Web Data

Web Usage

Application Server logs

Http logs

7/29/2019 BIDM

97/122

Web Data

Web Content

Web Mining Categories

7/29/2019 BIDM

98/122

Web Mining Categories

Web Content Mining

Discovering useful information from web

contents/data/documents.

Web Structure Mining

Discovering the model underlying link structures (topology)

on the Web. E.g. discovering authorities and hubs

Web Usage MiningMake sense of data generated by surfers

Usage data from logs, user profiles, user sessions, cookies,

user queries, bookmarks, mouse clicks and scrolls, etc.

99

Web Content Data Structure

7/29/2019 BIDM

99/122

Web Content Data Structure

Unstructured free text

Semi-structured HTML

More structured Table or Database

generated HTML pages

Multimedia data receive less attention than

text or hypertext

100

7/29/2019 BIDM

100/122

Web Content Mining Process of informationor resource discovery from

content of millions of sources across the World Wide

Web

E.g. Web data contents: text, Image, audio, video,

metadata and hyperlinks

Goes beyond key word extraction, or some simple

statistics of words and phrases in documents.

Web Content Mining

7/29/2019 BIDM

101/122

Web Content Mining

Pre-processing data before web content mining:feature selection (Piramuthu 2003)

Post-processing data can reduce ambiguous

searching results (Sigletos & Paliouras 2003)

Web Page Content Mining

Mines the contents of documents directly

Search Engine Mining

Improves on the content search of other tools like search

engines.

Web Content Mining

7/29/2019 BIDM

102/122

Web Content Mining

Web content mining is related to data miningand text mining. [Bing Liu. 2005]

It is related to data mining because many datamining techniques can be applied in Web contentmining.

It is related to text mining because much of theweb contents are texts.

Web data are mainly semi-structured and/orunstructured, while data mining is structured andtext is unstructured.

Web Content Mining: IR View
http://www.cs.uic.edu/~liubhttp://www.cs.uic.edu/~liub

7/29/2019 BIDM

103/122


Unstructured Documents

Bag of words, or phrase-based feature

representation Features can be boolean or frequency based

Features can be reduced using different featureselection techniques

Word stemming, combining morphologicalvariations into one feature

104


7/29/2019 BIDM

104/122


Semi-Structured Documents Uses richer representations for features, based on

information from the document structure

(typically HTML and hyperlinks) Uses common data mining methods (whereas

unstructured might use more text mining methods)

105

Web Content Mining: DB View

7/29/2019 BIDM

105/122

Web Content Mining: DB View

Tries to infer the structure of a Web site or transforma Web site to become a database

Better information management

Better querying on the Web

Can be achieved by: Finding the schema of Web documents

Building a Web warehouse

Building a Web knowledge base

Building a virtual database

106

Web-Structure Mining

7/29/2019 BIDM

106/122

Web-Structure Mining

Generate structural summaryabout the Website and Web page

Depending upon the hyperlink, Categorizing the Webpages and the related Information @ inter domain level

Discovering the Web Page Structure.

Discovering the nature of the hierarchy of hyperlinks inthe website and its structure.

Web-Structure Mining cont

7/29/2019 BIDM

107/122


Finding Information about web pages

Inference on Hyperlink

Retrieving information about the relevance and the quality

of the web page.

Finding the authoritative on the topic and content.

The web page contains not only information but also

hyperlinks, which contains huge amount of annotation.

Hyperlink identifies authors endorsement of the other webpage.


7/29/2019 BIDM

108/122


More Information on Web Structure Mining

Web Page Categorization. (Chakrabarti 1998)

Finding micro communities on the web

e.g. Google (Brin and Page, 1998)

Schema Discovery in Semi-Structured Environment.

Web Usage Mining

7/29/2019 BIDM

109/122

Web Usage Mining

Tries to predict user behavior frominteraction with the Web

Wide range of data (logs)

Web client data Proxy server data

Web server data

Two common approaches

Map usage data into relational tables before usingadapted data mining techniques

Use log data directly by utilizing special pre-processingtechniques

110

Web Usage Mining

7/29/2019 BIDM

110/122

Web Usage Mining

Typical problems: Distinguishing among

unique users, server sessions, episodes,

etc in the presence of caching and proxyservers

Often Usage Mining uses some

background or domain knowledge

E.g. site topology, Web content, etc

111

Web Usage Mining

7/29/2019 BIDM

111/122

Web Usage Mining

Two main categories: Learning a user profile (personalized)

Web users would be interested in techniques thatlearn their needs and preferences automatically

Learning user navigation patterns (impersonalized)

Information providers would be interested intechniques that improve the effectiveness of their

Web site or biasing the users towards the goals ofthe site

112

Web-Usage Mining cont

7/29/2019 BIDM

112/122

Web Usage Mining cont

Data Mining Techniques Navigation Patterns

Analysis:

Example:

70% of users who accessed/company/product2 did so by startingat/company and proceeding through/company/new,

/company/products and company/product1

80% of users who accessed the site started from

/company/products

65% of users left the site after

four or less page references


7/29/2019 BIDM

113/122


Data Mining Techniques Sequential Patterns

Example:

Supermarket

Cont

Customer Transaction Time Purchased Items

John 6/21/05 5:30 pm BeerJohn 6/22/05 10:20 pm Brandy

Frank 6/20/05 10:15 am Juice, CokeFrank 6/20/05 11:50 am BeerFrank 6/20/05 12:50 am Wine, Cider

Mary 6/20/05 2:30 pm Beer

Mary 6/21/05 6:17 pm Wine, CiderMary 6/22/05 5:05 pm Brandy


7/29/2019 BIDM

114/122



Customer SequenceCustomer Customer Sequences

John (Beer) (Brandy)

Frank (Juice, Coke) (Beer) (Wine, Cider)

Mary (Beer) (Wine, Cider) (Brandy)

Example:

Supermarket

Cont

Sequential Patterns with Supporting

Support >= 40% Customers

(Beer) (Brandy) John, Mary

(Beer) (Wine, Cider) Frank, Mary

Mining Result


7/29/2019 BIDM

115/122



Web usage examples

In Google search, within past week 30% of users who visited

/company/product/ had camera as text.

60% of users who placed an online order in

/company/product1 also placed an order in /company/product4

within 15 days

Tech for Web Content Mining

7/29/2019 BIDM

116/122

Tech for Web Content Mining

Classifications

Clustering

Association

Document Classification

7/29/2019 BIDM

117/122

Document Classification

Supervised Learning Supervised learning is a machine learningtechnique for creating a

function from training data .

Documents are categorized

The output can predict a class label of the input object (called

classification).

Techniques used are

Nearest Neighbor Classifier

Feature Selection Decision Tree

Feature Selection

7/29/2019 BIDM

118/122

Feature Selection

Removes terms in the training documents which arestatistically uncorrelated with the class labels

Simple heuristics

Stop words like a, an, the etc.

Empirically chosen thresholds for ignoring too

frequent or too rare terms Discard too frequent and too rare terms

Document Clustering

7/29/2019 BIDM

119/122

Document Clustering

Unsupervised Learning : a data set of input objects is gathered

Goal : Evolve measures of similarity to cluster a collection ofdocuments/terms into groups within which similarity within a cluster islarger than across clusters.

Hypothesis : Given a `suitable clustering of a collection, if the user isinterested in document/term d/t, he is likely to be interested in othermembers of the cluster to which d/tbelongs.

Hierarchical Bottom-Up

Top-Down

Partitional

Semi-Supervised Learning

7/29/2019 BIDM

120/122

p g

A collection of documents is available

A subset of the collection has known labels

Goal: to label the rest of the collection.

Approach Train a supervised learner using the labeled subset.

Apply the trained learner on the remaining documents.

Idea

Harness information in the labeled subset to enablebetter learning.

Also, check the collection for emergence of new topics

Association

7/29/2019 BIDM

121/122

Association

Example: SupermarketTransaction ID Items Purchased

1 butter, bread, milk2 bread, milk, beer, egg3 diaper

An association rule can be

If a customer buys milk, in 50% of cases, he/she also

buys beers. This happens in 33% of all transactions.

50%: confidence33%: support

Can also Integrate in Hyperlinks

7/29/2019 BIDM

122/122

Q & A

BIDM

Documents

Transcript of BIDM