BIDM
-
Upload
awadhesh-yadav -
Category
Documents
-
view
223 -
download
0
Transcript of BIDM
-
7/29/2019 BIDM
1/122
-
7/29/2019 BIDM
2/122
Business Intelligence and Data Mining (BI &DM)
Text Book:
Business Intelligence A Managerial Approach by
Efraim Turban, Ramesh Sharda, Dursun Delen and
Devid King, 2/e, Pearson, 2012
Reference Material:
Decision Support and Business Intelligence
Systems by Efraim Turban, Ramesh Sharda andDursun Delen, 9/e, Pearson, 2012
-
7/29/2019 BIDM
3/122
-
7/29/2019 BIDM
4/122
Sessions Plan Introduction to Business Intelligence
Decision Support Systems Concepts, Methodologies
and Technologies Data Warehousing
Business Performance Management
Data Mining for Business Intelligence Text and Web Mining
Business Intelligence: Implementation and Emerging
Trends
Business Intelligence and Data Mining (BI &DM)
-
7/29/2019 BIDM
5/122
Introduction to Text andWeb Mining
Business Intelligence and Data Mining (BI &DM)
-
7/29/2019 BIDM
6/122
-
7/29/2019 BIDM
7/122
Learning Objectives
Describe Web mining, its objectives, and its
benefits
Understand the three different branches of Web
mining Web content mining
Web structure mining
Web usage mining
Understand the applications of these three mining
paradigms
-
7/29/2019 BIDM
8/122
Opening Vignette
Mining Text For Security And Counterterrorism
What is MITRE?
Problem description Proposed solution
Results
Answer & discuss the case questions
-
7/29/2019 BIDM
9/122
-
7/29/2019 BIDM
10/122
-
7/29/2019 BIDM
11/122
-
7/29/2019 BIDM
12/122
-
7/29/2019 BIDM
13/122
-
7/29/2019 BIDM
14/122
What is Text-Mining?
finding interesting regularities in large
textualdatasets (adapted from Usama Fayad)
where interesting means: non-trivial, hidden,
previously unknown and potentially useful
finding semantic and abstract information
from the surface form of textual data
-
7/29/2019 BIDM
15/122
-
7/29/2019 BIDM
16/122
-
7/29/2019 BIDM
17/122
Semi-Structured Data
Text databases are, in general, semi-structured Example:
Title
Author Publication_Date
Length
Category
Abstruct
Content
Structured attributes/value pair
Unstructured
-
7/29/2019 BIDM
18/122
Text Mining Process
Text preprocessing Syntactic/Semantic
text analysis
Features Generation Bag of words
Features Selection Simple counting
Statistics
Text/Data Mining Classification
Clustering Associations
Analyzing results
-
7/29/2019 BIDM
19/122
-
7/29/2019 BIDM
20/122
-
7/29/2019 BIDM
21/122
-
7/29/2019 BIDM
22/122
Levels of text representations
Character (character n-grams and sequences)
Words (stop-words, stemming, lemmatization)
Phrases (word n-grams, proximity features)
Part-of-speech tags
Taxonomies / thesauri Vector-space model
Language models
Full-parsing
Cross-modality
Collaborative tagging / Web2.0
Templates / Frames
Ontologies / First order theories
-
7/29/2019 BIDM
23/122
Levels of text representations
Character
Words
Phrases
Part-of-speech tags
Taxonomies / thesauri
Vector-space model
Language models
Full-parsing
Cross-modality
-
7/29/2019 BIDM
24/122
Character level
Character level representation of a text
consists from sequences of characters
a document is represented by a frequency
distribution of sequences
Usually we deal with contiguous strings
each character sequence of length 1, 2, 3,
represent a feature with its frequency
-
7/29/2019 BIDM
25/122
-
7/29/2019 BIDM
26/122
Levels of text representations
Character
Words
Phrases
Part-of-speech tags
Taxonomies / thesauri
Vector-space model
Language models
Full-parsing
Cross-modality
-
7/29/2019 BIDM
27/122
Word level
The most common representation of text used
for many techniques
there are many tokenization software packages
which split text into the words
Important to know:
Word is well defined unit in western languages
e.g. Chinese has different notion of semantic unit
-
7/29/2019 BIDM
28/122
Words Properties
Relations among word surface forms and their senses:
Homonomy: same form, but different meaning (e.g. bank:river bank, financial institution)
Polysemy: same form, related meaning (e.g. bank: bloodbank, financial institution)
Synonymy: different form, same meaning (e.g. singer,vocalist)
Hyponymy: one word denotes a subclass of an another(e.g. breakfast, meal)
Word frequencies in texts have power distribution:
small number of very frequent words
big number of low frequency words
-
7/29/2019 BIDM
29/122
-
7/29/2019 BIDM
30/122
Levels of text representations
Character
Words
Phrases
Part-of-speech tags
Taxonomies / thesauri Vector-space model
Language models
Full-parsing
Cross-modality
-
7/29/2019 BIDM
31/122
Phrase level
Instead of having just single words we candeal with phrases
We use two types of phrases:
Phrases as frequent contiguous word sequences Phrases as frequent non-contiguous word
sequences
both types of phrases could be identified bysimple dynamic programming algorithm
The main effect of using phrases is to moreprecisely identify sense
-
7/29/2019 BIDM
32/122
Levels of text representations
Character
Words
Phrases
Part-of-speech tags
Taxonomies / thesauri
Vector-space model
Language models
Full-parsing
Cross-modality
-
7/29/2019 BIDM
33/122
Part-of-Speech level
By introducing part-of-speech tags we introduce word-
types enabling to differentiate words functions
For text-analysis part-of-speech information is used mainly for
information extraction where we are interested in e.g. named
entities which are noun phrases
Another possible use is reduction of the vocabulary (features)
it is known that nouns carry most of the information in text
documents
Part-of-Speech taggers are usually learned by HMMalgorithm on manually tagged data
-
7/29/2019 BIDM
34/122
Part-of-Speech Table
http://www.englishclub.com/grammar/parts-of-speech_1.htm
http://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htmhttp://www.englishclub.com/grammar/parts-of-speech_1.htm -
7/29/2019 BIDM
35/122
-
7/29/2019 BIDM
36/122
Levels of text representations
Character
Words
Phrases
Part-of-speech tags
Taxonomies / thesauri
Vector-space model
Language models
Full-parsing
Cross-modality
-
7/29/2019 BIDM
37/122
-
7/29/2019 BIDM
38/122
WordNet database of lexical relations
WordNet is the most well developedand widely used lexical database forEnglish it consist from 4 databases (nouns,
verbs, adjectives, and adverbs)
Each database consists from senseentries each sense consists from aset of synonyms, e.g.: musician, instrumentalist, player
person, individual, someone
life form, organism, being
Category Unique
Forms
Number
of
Senses
Noun 94474 116317
Verb 10319 22066
Adjective 20170 29881
Adverb 4546 5677
-
7/29/2019 BIDM
39/122
-
7/29/2019 BIDM
40/122
WordNet relations
Each WordNet entry is connected with other entries in the graph through
relations Relations in the database of nouns:
Relation Definition Example
Hypernym From lower to higher
concepts
breakfast -> meal
Hyponym From concepts to
subordinates
meal -> lunch
Has-Member From groups to their
members
faculty -> professor
Member-Of From members to their
groups
copilot -> crew
Has-Part From wholes to parts table -> leg
Part-Of From parts to wholes course -> meal
Antonym Opposites leader -> follower
D t R t ti
-
7/29/2019 BIDM
41/122
Document Representation
A document representation aims to capture
what the document is about
One possible approach
Each entry describes a document
Attribute describe whether or not a term appears
in the document
Term
Camera Digital Memory Pixel
Document 1 1 1 0 1
Document 2 1 1 0 0
- - - - -
-
7/29/2019 BIDM
42/122
-
7/29/2019 BIDM
43/122
Levels of text representations
Character
Words
Phrases
Part-of-speech tags
Taxonomies / thesauri Vector-space model
Language models
Full-parsing
Cross-modality
-
7/29/2019 BIDM
44/122
-
7/29/2019 BIDM
45/122
-
7/29/2019 BIDM
46/122
-
7/29/2019 BIDM
47/122
Distance Based Matching
In order retrieve documents similar to a given documentone need a measure of similarity
Euclidean distance
The Euclidean distance between
X = (x1, x2, x3, .., xn) and Y = (y1,y2,y3, .., yn)Is defined as
D(X,Y) = (xiyi)2
Similarity between document
-
7/29/2019 BIDM
48/122
Similarity between document
vectors
Each document is represented as a vector of weights D =
Cosine similarity (dot product) is the most widely used
similarity measure between two document vectors
calculates cosine of the angle between document vectors
efficient to calculate (sum of products of intersecting words)
similarity value between 0 (different) and 1 (the same)
k kj j
i
ii
xx
xxDDSim
22
21
21 ),(
P f M
-
7/29/2019 BIDM
49/122
Performance Measure The set of retrieved documents can be formed by collecting
the top-ranking documents according to a similarity measure The quality of a collection can be compared by the two
following measures
Relevant Relevant
Precision = ------------------------------------Retrieved
Relevant Retrieved
Recall = ------------------------------------------
Relevant
Relevant
Documents
Relevant &
Retrieved
Retrieved
Documents
-
7/29/2019 BIDM
50/122
-
7/29/2019 BIDM
51/122
Cluster Analysis for Data Mining
Analysis methods
Statistical methods (including both hierarchical
and nonhierarchical), such as k-means, k-modes,
and so on. Neural networks (adaptive resonance theory
[ART], self-organizing map [SOM])
Fuzzy logic (e.g., fuzzy c-means algorithm) Genetic algorithms
Divisive versus Agglomerative methods
T t Mi i f P t t A l i
-
7/29/2019 BIDM
52/122
Text Mining for Patent Analysis
(see Applications Case 7.2)
What is a patent?
exclusive rights granted by a country to an
inventor for a limited period of time in exchange
for a disclosure of an invention
How do we do patent analysis (PA)?
Why do we need to do PA?
What are the benefits?
What are the challenges?
How does text mining help in PA?
-
7/29/2019 BIDM
53/122
Natural Language Processing (NLP)
Structuring a collection of text Old approach: bag-of-words
New approach: natural language processing
NLP is a very important concept in text mining.
a subfield of artificial intelligence and computational
linguistics.
the study of "understanding" the natural humanlanguage.
Syntax versus semantics based text mining
-
7/29/2019 BIDM
54/122
-
7/29/2019 BIDM
55/122
Natural Language Processing (NLP)
Challenges in NLP Part-of-speech tagging
Text segmentation
Word sense disambiguation
Syntax ambiguity
Imperfect or irregular input
Speech acts
Dream of AI community
to have algorithms that are capable of automatically
reading and obtaining knowledge from text
-
7/29/2019 BIDM
56/122
Natural Language Processing (NLP)
WordNet A laboriously hand-coded database of English words,
their definitions, sets of synonyms, and various semantic
relations between synonym sets
A major resource for NLP
Needs automation to be completed
Sentiment Analysis
A technique used to detect favorable and unfavorableopinions toward specific products and services
See Application Case 7.3 for a CRM application
-
7/29/2019 BIDM
57/122
NLP Task Categories
Information retrieval
Information extraction
Named-entity recognition
Question answering
Automatic summarization
Natural language generation & understanding
Machine translation
Foreign language reading & writing Speech recognition
Text proofing
Optical character recognition
-
7/29/2019 BIDM
58/122
Text Mining Applications
Marketing applications Enables better CRM
Security applications
ECHELON, OASIS
Deception detection
example coming up
Medicine and biology
Literature-based gene identification example coming up
Academic applications
Research stream analysis - example coming up
T t Mi i A li ti
-
7/29/2019 BIDM
59/122
Text Mining Applications(gene/protein interaction identification)
Gene/
Protein 596 12043 24224 281020 42722 397276
D007962
D 016923
D 001773
D019254 D044465 D001769 D002477 D003643 D016158
185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523
NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP NP PP NP NP PP NP
Ontology
Word
POS
Shallow
Parse
...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.
-
7/29/2019 BIDM
60/122
-
7/29/2019 BIDM
61/122
Text Mining Process
Establish the Corpus:Collect & Organize the
Domain Specific
Unstructured Data
Create the Term-Document Matrix:Introduce Structure
to the Corpus
Extract Knowledge:Discover Novel
Patterns from the
T-D Matrix
The inputs to the process
includes a variety of relevant
unstructured (and semi-
structured) data sources such
as text, XML, HTML, etc.
The output of the Task 1 is a
collection of documents in
some digitized format for
computer processing
The output of the Task 2 is a
flat file called term-document
matrix where the cells are
populated with the term
frequencies
The output of Task 3 is a
number of problem specific
classification, association,
clustering models and
visualizations
Task 1 Task 2 Task 3
FeedbackFeedback
The three-step text mining process
-
7/29/2019 BIDM
62/122
Text Mining Process
Step 1: Establish the corpus
Collect all relevant unstructured data (e.g.,
textual documents, XML files, emails, Web
pages, short notes, voice recordings) Digitize, standardize the collection (e.g.,
all in ASCII text files)
Place the collection in a common place (e.g.,
in a flat file, or in a directory as separate files)
-
7/29/2019 BIDM
63/122
Text Mining Process
Step 2: Create the TermbyDocument Matrix
inv
estm
entri
sk
p
roject
man
agem
ent
softw
areen
ginee
ring
d
evelo
pment
1
SAP
...
Document 1
Document 2
Document 3
Document 4
Document 5
Document 6
...
Documents
Terms
1
1
1
2
1
1
1
3
1
-
7/29/2019 BIDM
64/122
Text Mining Process
Step 2: Create the TermbyDocumentMatrix (TDM)
Should all terms be included?
Stop words, include words Synonyms, homonyms
Stemming
What is the best representation of the indices
(values in cells)?
Row counts; binary frequencies; log frequencies;
Inverse document frequency
-
7/29/2019 BIDM
65/122
Text Mining Process
Step 2: Create the TermbyDocumentMatrix (TDM)
TDM is a sparse matrix. How can we reduce the
dimensionality of the TDM? Manual a domain expert goes through it
Eliminate terms with very few occurrences in very
few documents (?)
Transform the matrix using singular valuedecomposition (SVD)
SVD is similar to principle component analysis
-
7/29/2019 BIDM
66/122
Text Mining Process
Step 2: Extract patterns/knowledge Classification (text categorization)
Clustering (natural groupings of text)
Improve search recall Improve search precision
Scatter/gather
Query-specific clustering
Association
Trend Analysis ()
-
7/29/2019 BIDM
67/122
Web Mining The term created by Orem Etzioni (1996)
Application of data mining techniques toautomatically discover and extract information from
Web data
-
7/29/2019 BIDM
68/122
What is Web Mining?
Discovering useful information from the
World-Wide Web and its usage patterns
-
7/29/2019 BIDM
69/122
Web Mining v. Data Mining
Structure (or lack of it)
Textual information and linkage structure
Scale
Data generated per day is comparable to largest
conventional data warehouses
Speed
Often need to react to evolving usage patterns inreal-time (e.g., merchandising)
-
7/29/2019 BIDM
70/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
7/29/2019 BIDM
71/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
7/29/2019 BIDM
72/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
-
7/29/2019 BIDM
73/122
Size of the Web
Number of pages Technically, infinite
Much duplication (30-40%)
Best estimate of unique static HTML pagescomes from search engine claims
Until last year, Google claimed 8 billion(?), Yahooclaimed 20 billion
Google recently announced that their index contains 1trillion pages
How to explain the discrepancy?
-
7/29/2019 BIDM
74/122
The web as a graph
Pages = nodes, hyperlinks = edges
Ignore content
Directed graph
High linkage
10-20 links/page on average
Power-law degree distribution
-
7/29/2019 BIDM
75/122
Structure of Web graph
Lets take a closer look at structure
Broder et al (2000) studied a crawl of 200M pages
and other smaller crawls
Bow-tie structure Not a small world
-
7/29/2019 BIDM
76/122
Bow-tie Structure
Source: Broder et al, 2000
-
7/29/2019 BIDM
77/122
-
7/29/2019 BIDM
78/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
l d d b
-
7/29/2019 BIDM
79/122
Power-law degree distribution
Source: Broder et al, 2000
l l
-
7/29/2019 BIDM
80/122
Power-laws galore
Structure
In-degrees
Out-degrees
Number of pages per site
Usage patterns
Number of visitors
Popularity e.g., products, movies, music
h il
-
7/29/2019 BIDM
81/122
The Long Tail
Source: Chris Anderson (2004)
W b Mi i i
-
7/29/2019 BIDM
82/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
E i S d D
-
7/29/2019 BIDM
83/122
Extracting Structured Data
http://www.simplyhired.com
E i d d
-
7/29/2019 BIDM
84/122
Extracting structured data
http://www.fatlens.com
W b Mi i t i
-
7/29/2019 BIDM
85/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
Ad h lt
-
7/29/2019 BIDM
86/122
Ads vs. search results
Ad h lt
-
7/29/2019 BIDM
87/122
Ads vs. search results
Search advertising is the revenue model
Multi-billion-dollar industry
Advertisers pay for clicks on their ads
Interesting problems
What ads to show for a search?
If Im an advertiser, which search terms should I
bid on and how much to bid?
T A h t A l i D t
-
7/29/2019 BIDM
88/122
Two Approaches to Analyzing Data
Machine Learning approach
Emphasizes sophisticated algorithms e.g., Support
Vector Machines
Data sets tend to be small, fit in memory
Data Mining approach
Emphasizes big data sets (e.g., in the terabytes)
Data cannot even fit on a single disk! Necessarily leads to simpler algorithms
W b Mi i t i
-
7/29/2019 BIDM
89/122
Web Mining topics
Web graph analysis
Power Laws and The Long Tail
Structured data extraction
Web advertising
Systems Issues
S t hit t
-
7/29/2019 BIDM
90/122
Systems architecture
Memory
Disk
CPU
Machine Learning, Statistics
Classical Data Mining
V L S l D t Mi i
-
7/29/2019 BIDM
91/122
Very Large-Scale Data Mining
Mem
Disk
CPU
Mem
Disk
CPU
Mem
Disk
CPU
Cluster of commodity nodes
Systems Issues
-
7/29/2019 BIDM
92/122
Systems Issues
Web data sets can be very large
Tens to hundreds of terabytes
Cannot mine on a single server!
Need large farms of servers
How to organize hardware/software to mine
multi-terabye data sets
Without breaking the bank!
Project
-
7/29/2019 BIDM
93/122
Project
Lots of interesting project ideas If you cant think of one please come discuss with us
Infrastructure
Aster Data cluster on Amazon EC2
Supports both MapReduce and SQL Data
Netflix
ShareThis
Google WebBase
TREC
Data Mining vs Web Mining
-
7/29/2019 BIDM
94/122
Data Mining vs. Web Mining
Traditional data mining data is structured and relational
well-defined tables, columns, rows, keys, and
constraints.
Web data
Semi-structured and unstructured
readily available data rich in features and patterns
-
7/29/2019 BIDM
95/122
Web Data
Web Structure
tag Click here to
Shop Online
http://www.walmart.com/http://www.walmart.com/ -
7/29/2019 BIDM
96/122
Web Data
Web Usage
Application Server logs
Http logs
-
7/29/2019 BIDM
97/122
Web Data
Web Content
Web Mining Categories
-
7/29/2019 BIDM
98/122
Web Mining Categories
Web Content Mining
Discovering useful information from web
contents/data/documents.
Web Structure Mining
Discovering the model underlying link structures (topology)
on the Web. E.g. discovering authorities and hubs
Web Usage MiningMake sense of data generated by surfers
Usage data from logs, user profiles, user sessions, cookies,
user queries, bookmarks, mouse clicks and scrolls, etc.
99
Web Content Data Structure
-
7/29/2019 BIDM
99/122
Web Content Data Structure
Unstructured free text
Semi-structured HTML
More structured Table or Database
generated HTML pages
Multimedia data receive less attention than
text or hypertext
100
-
7/29/2019 BIDM
100/122
Web Content Mining Process of informationor resource discovery from
content of millions of sources across the World Wide
Web
E.g. Web data contents: text, Image, audio, video,
metadata and hyperlinks
Goes beyond key word extraction, or some simple
statistics of words and phrases in documents.
Web Content Mining
-
7/29/2019 BIDM
101/122
Web Content Mining
Pre-processing data before web content mining:feature selection (Piramuthu 2003)
Post-processing data can reduce ambiguous
searching results (Sigletos & Paliouras 2003)
Web Page Content Mining
Mines the contents of documents directly
Search Engine Mining
Improves on the content search of other tools like search
engines.
Web Content Mining
-
7/29/2019 BIDM
102/122
Web Content Mining
Web content mining is related to data miningand text mining. [Bing Liu. 2005]
It is related to data mining because many datamining techniques can be applied in Web contentmining.
It is related to text mining because much of theweb contents are texts.
Web data are mainly semi-structured and/orunstructured, while data mining is structured andtext is unstructured.
Web Content Mining: IR View
http://www.cs.uic.edu/~liubhttp://www.cs.uic.edu/~liub -
7/29/2019 BIDM
103/122
Web Content Mining: IR View
Unstructured Documents
Bag of words, or phrase-based feature
representation Features can be boolean or frequency based
Features can be reduced using different featureselection techniques
Word stemming, combining morphologicalvariations into one feature
104
Web Content Mining: IR View
-
7/29/2019 BIDM
104/122
Web Content Mining: IR View
Semi-Structured Documents Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks) Uses common data mining methods (whereas
unstructured might use more text mining methods)
105
Web Content Mining: DB View
-
7/29/2019 BIDM
105/122
Web Content Mining: DB View
Tries to infer the structure of a Web site or transforma Web site to become a database
Better information management
Better querying on the Web
Can be achieved by: Finding the schema of Web documents
Building a Web warehouse
Building a Web knowledge base
Building a virtual database
106
Web-Structure Mining
-
7/29/2019 BIDM
106/122
Web-Structure Mining
Generate structural summaryabout the Website and Web page
Depending upon the hyperlink, Categorizing the Webpages and the related Information @ inter domain level
Discovering the Web Page Structure.
Discovering the nature of the hierarchy of hyperlinks inthe website and its structure.
Web-Structure Mining cont
-
7/29/2019 BIDM
107/122
Web-Structure Mining cont
Finding Information about web pages
Inference on Hyperlink
Retrieving information about the relevance and the quality
of the web page.
Finding the authoritative on the topic and content.
The web page contains not only information but also
hyperlinks, which contains huge amount of annotation.
Hyperlink identifies authors endorsement of the other webpage.
Web-Structure Mining cont
-
7/29/2019 BIDM
108/122
Web-Structure Mining cont
More Information on Web Structure Mining
Web Page Categorization. (Chakrabarti 1998)
Finding micro communities on the web
e.g. Google (Brin and Page, 1998)
Schema Discovery in Semi-Structured Environment.
Web Usage Mining
-
7/29/2019 BIDM
109/122
Web Usage Mining
Tries to predict user behavior frominteraction with the Web
Wide range of data (logs)
Web client data Proxy server data
Web server data
Two common approaches
Map usage data into relational tables before usingadapted data mining techniques
Use log data directly by utilizing special pre-processingtechniques
110
Web Usage Mining
-
7/29/2019 BIDM
110/122
Web Usage Mining
Typical problems: Distinguishing among
unique users, server sessions, episodes,
etc in the presence of caching and proxyservers
Often Usage Mining uses some
background or domain knowledge
E.g. site topology, Web content, etc
111
Web Usage Mining
-
7/29/2019 BIDM
111/122
Web Usage Mining
Two main categories: Learning a user profile (personalized)
Web users would be interested in techniques thatlearn their needs and preferences automatically
Learning user navigation patterns (impersonalized)
Information providers would be interested intechniques that improve the effectiveness of their
Web site or biasing the users towards the goals ofthe site
112
Web-Usage Mining cont
-
7/29/2019 BIDM
112/122
Web Usage Mining cont
Data Mining Techniques Navigation Patterns
Analysis:
Example:
70% of users who accessed/company/product2 did so by startingat/company and proceeding through/company/new,
/company/products and company/product1
80% of users who accessed the site started from
/company/products
65% of users left the site after
four or less page references
Web-Usage Mining cont
-
7/29/2019 BIDM
113/122
Web Usage Mining cont
Data Mining Techniques Sequential Patterns
Example:
Supermarket
Cont
Customer Transaction Time Purchased Items
John 6/21/05 5:30 pm BeerJohn 6/22/05 10:20 pm Brandy
Frank 6/20/05 10:15 am Juice, CokeFrank 6/20/05 11:50 am BeerFrank 6/20/05 12:50 am Wine, Cider
Mary 6/20/05 2:30 pm Beer
Mary 6/21/05 6:17 pm Wine, CiderMary 6/22/05 5:05 pm Brandy
Web-Usage Mining cont
-
7/29/2019 BIDM
114/122
Web Usage Mining cont
Data Mining Techniques Sequential Patterns
Customer SequenceCustomer Customer Sequences
John (Beer) (Brandy)
Frank (Juice, Coke) (Beer) (Wine, Cider)
Mary (Beer) (Wine, Cider) (Brandy)
Example:
Supermarket
Cont
Sequential Patterns with Supporting
Support >= 40% Customers
(Beer) (Brandy) John, Mary
(Beer) (Wine, Cider) Frank, Mary
Mining Result
Web-Usage Mining cont
-
7/29/2019 BIDM
115/122
Web Usage Mining cont
Data Mining Techniques Sequential Patterns
Web usage examples
In Google search, within past week 30% of users who visited
/company/product/ had camera as text.
60% of users who placed an online order in
/company/product1 also placed an order in /company/product4
within 15 days
Tech for Web Content Mining
-
7/29/2019 BIDM
116/122
Tech for Web Content Mining
Classifications
Clustering
Association
Document Classification
-
7/29/2019 BIDM
117/122
Document Classification
Supervised Learning Supervised learning is a machine learningtechnique for creating a
function from training data .
Documents are categorized
The output can predict a class label of the input object (called
classification).
Techniques used are
Nearest Neighbor Classifier
Feature Selection Decision Tree
Feature Selection
-
7/29/2019 BIDM
118/122
Feature Selection
Removes terms in the training documents which arestatistically uncorrelated with the class labels
Simple heuristics
Stop words like a, an, the etc.
Empirically chosen thresholds for ignoring too
frequent or too rare terms Discard too frequent and too rare terms
Document Clustering
-
7/29/2019 BIDM
119/122
Document Clustering
Unsupervised Learning : a data set of input objects is gathered
Goal : Evolve measures of similarity to cluster a collection ofdocuments/terms into groups within which similarity within a cluster islarger than across clusters.
Hypothesis : Given a `suitable clustering of a collection, if the user isinterested in document/term d/t, he is likely to be interested in othermembers of the cluster to which d/tbelongs.
Hierarchical Bottom-Up
Top-Down
Partitional
Semi-Supervised Learning
-
7/29/2019 BIDM
120/122
p g
A collection of documents is available
A subset of the collection has known labels
Goal: to label the rest of the collection.
Approach Train a supervised learner using the labeled subset.
Apply the trained learner on the remaining documents.
Idea
Harness information in the labeled subset to enablebetter learning.
Also, check the collection for emergence of new topics
Association
-
7/29/2019 BIDM
121/122
Association
Example: SupermarketTransaction ID Items Purchased
1 butter, bread, milk2 bread, milk, beer, egg3 diaper
An association rule can be
If a customer buys milk, in 50% of cases, he/she also
buys beers. This happens in 33% of all transactions.
50%: confidence33%: support
Can also Integrate in Hyperlinks
-
7/29/2019 BIDM
122/122
Q & A