Introduction - Simon Fraser University...– The sample should be truly random • On a data set of...

Introduction

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries:

Techniques: Business Intelligence

•  Multidimensional data analysis •  Online query answering •  Interactive data exploration


Motivation: Store Layout Design


http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Techniques: Store Layout Design

•  Customer purchase patterns •  Business strategies


Motivation: Community Detection


http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811

Techniques: Community Detection

•  Similarity between objects •  Partitioning objects into groups

– No guidance about what a group is


Motivation: Disease Prediction


Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

What medical problems does this patient has?

Techniques: Disease Prediction

•  Features •  Model


Motivation: Fraud Detection


http://i.imgur.com/ckkoAOp.gif

Techniques: Fraud Detection

•  Features •  Dissimilarity •  Groups and noise


http://i.stack.imgur.com/tRDGU.png

What Is Data Science About?

•  Data •  Extraction of knowledge from data •  Continuation of data mining and knowledge

discovery from data (KDD)


What Is Data?

•  Values of qualitative or quantitative variables belonging to a set of items

•  Represented in a structure, e.g., tabular, tree or graph structure

•  Typically the results of measurements •  As an abstract concept can be viewed as the

lowest level of abstraction from which information and then knowledge are derived


What Is Information?

•  “Knowledge communicated or received concerning a particular fact or circumstance”

•  Conceptually, information is the message (utterance or expression) being conveyed

•  Cannot be predicted •  Can resolve uncertainty


What Is Knowledge?

•  Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education

•  Implicit knowledge: practical skill or expertise •  Explicit knowledge: theoretical

understanding of a subject


Data Systems

•  A data system answers queries based on data acquired in the past

•  Base data – the rawest data not derived from anywhere else

•  Knowledge – information derived from the base data


Dealing with Data – Querying

•  Given a set of student records about name, age, courses taken and grades

•  Simple queries – What is John Doe’s age?

•  Aggregate queries – What is the average GPA of all students at this

school? •  Queries can be arbitrarily complicated

– Find the students X and Y whose grades are less than 3% apart in as many courses as possible


Queries

•  A precise request for information •  Subjects in databases and information

retrieval – Databases: structured queries on structured

(e.g., relational) data –  Information retrieval: unstructured queries on

unstructured (e.g., text, image) data •  Important assumptions

–  Information needs – Query languages


Data-driven Exploration

•  What should be the next strategy of a company? – A lot of data: sales, human resource, production,

tax, service cost, … •  The question cannot be translated into a

precise request for information (i.e., a query) •  Developing familiarity (knowledge) and

actionable items (decisions) by interactively analyzing data


Data-driven Thinking

•  Starting with some simple queries •  New queries are raised by consuming the

results of previous queries •  No ultimate query in design!

– But many queries can be answered using DB/IR techniques


The Art of Data-driven Thinking

•  The way of generating queries remains an art! – Different people may derive different results

using the same data

“If you torture the data long enough, it will confess” – Ronald H. Coase

•  More often than not, more data may be needed – datafication


Queries for Data-driven Thinking

•  Probe queries – finding information about specific individuals

•  Aggregation – finding information about groups •  Pattern finding – finding commonality in

population •  Association and correlation – finding

connections among individuals and groups •  Causality analysis – finding causes and

consequences


What Is Data Mining?

•  Broader sense: the art of data-driven thinking

•  Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of

queries in the data mining process in the broader sense


Machine Learning

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”

– Tom M. Mitchell •  Essentially, learn the distribution of data


Data mining vs. Machine Learning

•  Machine learning focuses on prediction, based on known properties learned from the training data

•  Data mining focuses on the discovery of (previously) unknown properties on the data



The KDD Process

Data

Target data

Preprocessed data

Transformed data

Patterns

Knowledge

Selection Preprocessing

Transformation

Data mining

Interpretation/evaluation

Data Mining R&D

•  New problem identification •  Data collection and transformation •  Algorithm design and implementation •  Evaluation

– Effectiveness evaluation – Efficiency & scalability evaluation

•  Deployment and business solution


Data Mining on Big Data

“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it”

– Hal Varian, Google’s Chief Economist


What Is Big Data?

•  No quantitative definition! •  “Big data is like teenage sex

– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...”

– Dan Ariely


Data Volume vs. Storage Cost

•  The unit cost of disk storage decreases dramatically


Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB

http://ns1758.ca/winch/winchest.html

Big Data – Volume

“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time”

— Wikipedia


Big Data: Volume

•  Every day, about 7 billion shares change hands on US equity markets – About 2/3 is traded by computer algorithms based

on huge amounts of data to predict gains and risk •  In Q2 2015

– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million

outside China –  LinkedIn has 380 million active users – Twitter has 304 active users


Velocity

•  Google processes 24+ petabytes of data per day

•  Facebook gets 10+ million new photos uploaded every hour

•  Facebook members like or leave a comment 3+ billion times per day

•  YouTube users upload 1+ hour of video every second

•  400+ million tweets per day


What Has Been Changed?

•  The 1880 census in the US took 8 years to complete – The 1890 census would need 13 years – using

punch cards, it was reduced to less than 1 year •  It is essential to get not only the accurate but

also the timely data – Statisticians use sampling to estimate

•  Recently, with the new technologies, the ways of data collection and transmission have been fundamentally changed


Sampling for Volume/Velocity?

•  Sampling idea: the marginal new information brought by larger amount of data shrinks quickly – The sample should be truly random

•  On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories of attribute combinations – Finding outliers and exceptions

•  Big data contains signals of different strengths – No noise, instead weaker and weaker, but still may

be interesting and important signals


Big Data – Leytro Pictures

•  Lytro pictures record the whole light field – Photographers can decide later which parts to

focus on •  Big data tries to record as much information

as possible – Analysts can decide later what to extract from

big data – Both advantages and challenges


Veracity

•  “1 in 3 business leaders don't trust the information they use to make decisions”

•  Assuming a slowly growing total cost budget, tradeoff between data volume and data quality

•  Loss of veracity in combining different types of information from different sources

•  Loss of veracity in data extraction, transformation, and processing


Variety

•  Integrating data capturing different aspects of a data object – Vancouver Canucks: game video, technical

statistics, social media, … – Different pieces are in different format

•  Different views of the same data object from different sources – Did the soccer ball pass the goal line? – The views may not be consistent


Four V-challenges

•  Volume: massive scale and growth, 40% per year in global data generated

•  Velocity: real time data generation and consumption

•  Variety: heterogeneous data, mainly unstructured or semi-structured, from many sources

•  Veracity


Is Big Data Really New?

•  People were aware of the existence of big data long time ago, but no one can access it until very recently –  (Genesis 28:15) “I am with you and will watch

over you wherever you go” –  “密室私语，天闻如雷；暗室欺⼼，神目如电；善恶之报，如影随⾏”

– Similar statements in Quran and Sutra •  What has been changed?

– How is data connected with people


Diversity in Data Usage

•  In the past, only very few projects can afford to be data-intensive

•  Nowadays, excessive applications are (naturally) data-intensive


Datafication

•  Extract data about an object or event in a quantified way so that it can be analyzed – Different from digitalization

•  An important feature of big data •  Key: new data, new applications, new

opportunities


New Values of Datafication

•  Example: Captcha and ReCaptcha (Luis von Ahn)

•  How to create new values of data and datafication? – Connecting data with new users – Connecting different pieces of data to present a

bigger picture •  Important techniques

– Data aggregation – Extended datafication


Big Data Players

•  Data holders •  Data specialists •  Big-data mindset leaders •  A capable company may play 2 or 3 roles at

the same time •  What is most important, big-data mindset,

skills, or data itself?


Privacy

•  “… big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace”

— Executive Office of the (US) President


Keep in Mind

“Our industry does not respect tradition – it only respects innovation.”

– Satya Nadella



Goals of This Course

•  Data-driven thinking – towards being a (big) data scientist

•  Principles and hands-on skills of data mining, particularly in the context of big data –  Identifying new data mining problems – Data mining algorithm design – Data mining applications

•  Novel problems for upcoming research

Format

•  Due to the fast progress in data mining, we will go beyond the textbook substantially

•  Active classroom discussion •  Open questions and brainstorming •  Textbook: Data Mining – Concepts and

Techniques (3rd ed)


Read – Try – Think

•  Reading –  (required) Textbook and a small number of research

papers – You have to have the 3rd ed of the textbook! –  (open end, not covered by the exam) Technical and

non-technical materials •  Trying

– Assignments and a project •  Thinking

– Examine everything from a data scientist angle from today



Data Mining: History

•  1989 IJCAI Workshop on Knowledge Discovery in Databases –  Knowledge Discovery in Databases (G.

Piatetsky-Shapiro and W. Frawley, 1991) •  91-94 Workshops on Knowledge

Discovery in Databases –  Advances in Knowledge Discovery and

Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)


Data Mining: History (cont’d)

•  95-98 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) –  Journal of Data Mining and Knowledge Discovery (1997)

•  ACM SIGKDD conferences since 1998 and SIGKDD Explorations

•  More conferences on data mining –  PAKDD (1997), PKDD (1997), SIAM-Data Mining

(2001), (IEEE) ICDM (2001), etc. •  ACM Transactions on KDD starting in 2007

Frequent Pattern Mining

How Many Words Is a Picture Worth?

Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 53

E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Burnt or Burned?


E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013

Store Layout Design


http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg

Transaction Data

•  Alphabet: a set of items – Example: all products sold in a store

•  A transaction: a set of items involved in an activity – Example: the items purchased by a customer in

a visit •  Other information is often associated

– Timestamp, price, salesperson, customer-id, store-id, …


Examples of Transaction Data

•  •  •  •  • 


How to Store Transaction Data?

•  Transaction-id (t123, a, b, c) (t236, b, d)

•  Relational storage •  Transaction-based storage •  Item-based (vertical) storage

–  Item a: …, t123, … –  Item b: …, t123, …, t236, … – …


Tid Item t123 a t123 b t123 c … … t236 b t236 d


Transaction Data Analysis

•  Transactions: customers’ purchases of commodities –  {bread, milk, cheese} if they are bought together

•  Frequent patterns: product combinations that are frequently purchased together by customers

•  Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]


Why Frequent Patterns?

•  What products were often purchased together?

•  What are the frequent subsequent purchases after buying a iPod?

•  What kinds of genes are sensitive to this new drug?

•  What key-word combinations are frequently associated with web pages about game-evaluation?


Why Frequent Pattern Mining?

•  Foundation for many data mining tasks – Association rules, correlation, causality,

sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …

•  Broad applications – Basket data analysis, cross-marketing, catalog

design, sale campaign analysis, web log (click stream) analysis, …


Frequent Itemsets

•  Itemset: a set of items –  E.g., acm = {a, c, m}

•  Support of itemsets –  Sup(acm) = 3

•  Given min_sup = 3, acm is a frequent pattern

•  Frequent pattern mining: finding all frequent patterns in a database

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB


A Naïve Attempt

•  Generate all possible itemsets, test their supports against the database

•  How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets

•  How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the

support of 220 – 1 = 1,048,575 itemsets


Transactions in Real Applications

•  A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books

relevant to data mining •  Walmart has more than 20 million

transactions per day, AT&T produces more than 275 million calls per day

•  Mining large transaction databases of many items is a real demand


How to Get an Efficient Method?

•  Reducing the number of itemsets that need to be checked

•  Checking the supports of selected itemsets efficiently


Candidate Generation & Test

•  Any subset of a frequent itemset must also be frequent – an anti-monotonic property –  A transaction containing {beer, diaper, nuts} also

contains {beer, diaper} –  {beer, diaper, nuts} is frequent à {beer, diaper} must

also be frequent •  In other words, any superset of an infrequent

itemset must also be infrequent –  No superset of any infrequent itemset should be

generated or tested –  Many item combinations can be pruned!


Apriori-Based Mining

•  Generate length (k+1) candidate itemsets from length k frequent itemsets, and

•  Test the candidates against DB


The Apriori Algorithm [AgSr94]

TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2

Itemset Sup a 2 b 3 c 3 d 1 e 3

Data base D 1-candidates

Scan D

Itemset Sup a 2 b 3 c 3 e 3

Freq 1-itemsets Itemset

ab ac ae bc be ce

2-candidates

Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2

Counting

Scan D

Itemset Sup ac 2 bc 2 be 3 ce 2

Freq 2-itemsets Itemset

bce

3-candidates

Itemset Sup bce 2

Freq 3-itemsets

Scan D


The Apriori Algorithm Level-wise, candidate generation and test •  Ck: Candidate itemset of size k •  Lk : frequent itemset of size k

•  L1 = {frequent items}; •  for (k = 1; Lk !=∅; k++) do

–  Ck+1 = candidates generated from Lk; –  for each transaction t in database do increment the

count of all candidates in Ck+1 that are contained in t –  Lk+1 = candidates in Ck+1 with min_support

•  return ∪k Lk;

Candidate generation

Test


Important Steps in Apriori

•  How to find frequent 1- and 2-itemsets? •  How to generate candidates?

– Step 1: self-joining Lk

– Step 2: pruning •  How to count supports of candidates?

Finding Frequent 1- & 2-itemsets

•  Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array –  Initialize c[item]=0 for each item – For each transaction T, for each item in T,

c[item]++; –  If c[item]>=min_sup, item is frequent

•  Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij


Counting Array

•  A 2-dimensional triangle matrix can be implemented using a 1-dimensional array

1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5

There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]

1 2 3 4 5 6 7 8 9 10


Example of Candidate-generation

•  L3 = {abc, abd, acd, ace, bcd} •  Self-joining: L3*L3

– abcd ß abc * abd – acde ß acd * ace

•  Pruning: – acde is removed because ade is not in L3

•  C4={abcd}

How to Generate Candidates? •  Suppose the items in Lk-1 are listed in an order •  Step 1: self-join Lk-1

INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <

q.itemk-1

•  Step 2: pruning –  For each itemset c in Ck do

•  For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck


How to Count Supports?

•  Why is counting supports of candidates a problem? –  The total number of candidates can be very huge –  One transaction may contain many candidates

•  Method –  Candidate itemsets are stored in a hash-tree –  A leaf node of hash-tree contains a list of itemsets and

counts –  Interior node contains a hash table –  Subset function: finds all the candidates contained in a

transaction


Example: Counting Supports

1,4,7 2,5,8

3,6,9 Subset function

2 3 4 5 6 7

1 4 5 1 3 6

1 2 4 4 5 7 1 2 5

4 5 8 1 5 9

3 4 5 3 5 6 3 5 7 6 8 9

3 6 7 3 6 8

Transaction: 1 2 3 5 6

1 + 2 3 5 6

1 2 + 3 5 6

1 3 + 5 6


Association Rules

•  Rule c à am •  Support: 3 (i.e., the support

of acm) •  Confidence: 75% (i.e.,

sup(acm) / sup(c)) •  Given a minimum support

threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds

TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n

Transaction database TDB


Challenges of Freq Pat Mining

•  Multiple scans of transaction database •  Huge number of candidates •  Tedious workload of support counting for

candidates


Improving Apriori: Ideas

•  Reducing the number of transaction database scans

•  Shrinking the number of candidates •  Facilitating support counting of candidates


Bottleneck of Freq Pattern Mining

•  Multiple database scans are costly •  Mining long patterns needs many scans and

generates many candidates – To find frequent itemset i1i2…i100

•  # of scans: 100 •  # of Candidates:

– Bottleneck: candidate-generation-and-test •  Can we avoid candidate generation?

30100 1027.112100100

2100

1100

×≈−=⎟⎟⎠

⎞⎜⎜⎝

⎛++⎟⎟

⎠

⎞⎜⎜⎝

⎛+⎟⎟⎠

⎞⎜⎜⎝

⎛!


Search Space of Freq. Pat. Mining

•  Itemsets form a lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Itemset lattice


Set Enumeration Tree

•  Use an order on items, enumerate itemsets in lexicographic order –  a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d

•  Reduce a lattice to a tree ∅

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd

Set enumeration tree


Borders of Frequent Itemsets

•  Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent

∅

a b c d

ab ac ad bc bd cd

abc abd acd bcd

abcd


Projected Databases

•  To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected

database ∅

a b c d ab ac ad bc bd cd

abc abd acd bcd

abcd


Compress Database by FP-tree •  The 1st scan: find

frequent items –  Only record frequent

items in FP-tree –  F-list: f-c-a-b-m-p

•  The 2nd scan: construct tree –  Order frequent items in

each transaction w.r.t. f-list

–  Explore sharing among transactions

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

TID Items bought (ordered) freq items

100 f, a, c, d, g, I, m, p f, c, a, m, p

200 a, b, c, f, l,m, o f, c, a, b, m

300 b, f, h, j, o f, b

400 b, c, k, s, p c, b, p

500 a, f, c, e, l, p, m, n f, c, a, m, p


Benefits of FP-tree

•  Completeness –  Never break a long pattern in any transaction –  Preserve complete information for freq pattern mining

•  Not scan database anymore

•  Compactness –  Reduce irrelevant info — infrequent items are removed –  Items in frequency descending order (f-list): the more

frequently occurring, the more likely to be shared –  Never be larger than the original database (not counting

node-links and the count fields)


Partitioning Frequent Patterns

•  Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f

•  Depth-first search of a set enumeration tree – The partitioning is complete and does not have

any overlap


•  Only transactions containing p are needed •  Form p-projected database

– Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p

Find Patterns Having Item “p”

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

p-projected database TDB|p fcam: 2 cb: 1

Local frequent item: c:3 Frequent patterns containing p

p: 3, pc: 3


Find Pat Having Item m But No p

•  Form m-projected database TDB|m –  Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a

•  Build FP-tree for TDB|m root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Header table item

f c a b m p

Header table item

f c a

root f:3 c:3 a:3

m-projected FP-tree


Recursive Mining

•  Patterns having m but no p can be mined recursively

•  Optimization: enumerate patterns from a single-branch FP-tree – Enumerate all combination – Support = that of the last item

•  m, fm, cm, am •  fcm, fam, cam •  fcam

Header table item

f c a

root

f:3

c:3

a:3

m-projected FP-tree


Enumerate Patterns From Single Prefix of FP-tree •  A (projected) FP-tree has a single prefix

– Reduce the single prefix into one node – Join the mining results of the two parts

Ú a2:n2

a3:n3

a1:n1

root

b1:m1 c1:k1

c2:k2 c3:k3

+ a2:n2

a3:n3

a1:n1

root

r =

r1

b1:m1 c1:k1

c2:k2 c3:k3


FP-growth

•  Pattern-growth: recursively grow frequent patterns by pattern and database partitioning

•  Algorithm –  For each frequent item, construct its projected database,

and then its projected FP-tree –  Repeat the process on each newly created projected

FP-tree –  Until the resulted FP-tree is empty, or contains only one

path – single path generates all the combinations, each of which is a frequent pattern


Scaling up by DB Projection

•  What if an FP-tree cannot fit into memory? •  Database projection

– Partition a database into a set of projected databases

– Construct and mine FP-tree once the projected database can fit into main memory

•  Heuristic: Projected database shrinks quickly in many applications


Parallel vs. Partition Projection

•  Parallel projection: form all projected database at a time

•  Partition projection: propagate projections

Tran. DB fcamp fcabm fb cbp fcamp

p-proj DB fcam cb fcam

m-proj DB fcab fca fca

b-proj DB f cb …

a-proj DB fc …

c-proj DB f …

f-proj DB …

am-proj DB fc fc fc

cm-proj DB f f f

…


Why Is FP-growth Efficient?

•  Divide-and-conquer strategy – Decompose both the mining task and DB – Lead to focused search of smaller databases

•  Other factors – No candidate generation nor candidate test – Database compression using FP-tree – No repeated scan of entire database – Basic operations – counting local frequent items

and building FP-tree, no pattern search nor pattern matching


Major Costs in FP-growth

•  Poor locality of FP-trees – Low hit rate of cache

•  Building FP-trees – A stack of FP-trees

•  Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees

Effectiveness of Freq Pat Mining

•  Too many patterns! – A pattern a1a2…an contains 2n-1 subpatterns – Understanding many patterns is difficult or even

impossible for human users •  Non-focused mining

– A manager may be only interested in patterns involving some items (s)he manages

– A user is often interested in patterns satisfying some constraints


Itemset Lattice ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD, ACD

Min_sup=2


Max-Patterns ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD

Min_sup=2


Borders and Max-patterns

•  Max-patterns: borders of frequent patterns – Any subset of max-pattern is frequent – Any superset of max-pattern is infrequent – Cannot generate rules ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

{} Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 100

Patterns and Support Counts ABCD

ABC:2 ABD:2 ACD BCD

AB:3 AC:2 BC:2 AD:3 BD:2 CD:2

A:4 B:4 C:3 D:4

{}

Tid transaction

10 ABD

20 ABC

30 AD

40 ABCD

50 CD

Len Frequent itemsets 1 A:4, B:4, C:3, D:4 2 AB:3, AC:2, AD:3, BC:3, BD:2, CD:2 3 ABC:2, ABD:2

Min_sup=2


Frequent Closed Patterns

•  For frequent itemset X, if there exists no item y not in X s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acdf” is a frequent closed pattern

•  Concise rep. of freq pats – Can generate non-redundant rules

•  Reduce # of patterns and rules •  N. Pasquier et al. In ICDT’99

TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f

Min_sup=2


Closed and Max-patterns

•  Closed pattern mining algorithms can be adapted to mine max-patterns – A max-pattern must be closed

•  Depth-first search methods have advantages over breadth-first search ones – Why?


Constraint-based Data Mining

•  Find all the patterns in a database autonomously? –  The patterns could be too many but not focused!

•  Data mining should be interactive –  User directs what to be mined

•  Constraint-based mining –  User flexibility: provides constraints on what to be mined –  System optimization: push constraints for efficient mining


Constraints in Data Mining

•  Knowledge type constraint –  classification, association, etc.

•  Data constraint — using SQL-like queries –  find product pairs sold together in stores in New York

•  Dimension/level constraint –  in relevance to region, price, brand, customer category

•  Rule (or pattern) constraint –  small sales (price < $10) triggers big sales (sum >$200)

•  Interestingness constraint –  strong rules: support and confidence

Constrained Mining vs. Search

•  Constrained mining vs. constraint-based search –  Both aim at reducing search space –  Finding all patterns vs. some (or one) answers satisfying

constraints –  Constraint-pushing vs. heuristic search –  An interesting research problem on integrating both

•  Constrained mining vs. DBMS query processing –  Database query processing requires to find all –  Constrained pattern mining shares a similar philosophy

as pushing selections deeply in query processing


Optimization

•  Mining frequent patterns with constraint C –  Sound: only find patterns satisfying the constraints C –  Complete: find all patterns satisfying the constraints C

•  A naïve solution –  Constraint test as a post-processing

•  More efficient approaches –  Analyze the properties of constraints –  Push constraints as deeply as possible into frequent

pattern mining


Anti-Monotonicity

•  Anti-monotonicity – An intemset S violates the constraint, so does

any of its superset – sum(S.Price) ≤ v is anti-monotone – sum(S.Price) ≥ v is not anti-monotone

•  Example – C: range(S.profit) ≤ 15 –  Itemset ab violates C – So does every superset of ab

TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g

TDB (min_sup=2)

Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10


Anti-monotonic Constraints Constraint Antimonotone

v ∈ S No S ⊆ V no S ⊆ V yes

min(S) ≤ v no min(S) ≥ v yes max(S) ≤ v yes max(S) ≥ v no

count(S) ≤ v yes count(S) ≥ v no

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no

range(S) ≤ v yes range(S) ≥ v no

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ yes support(S) ≤ ξ no


Monotonicity

•  Monotonicity – An intemset S satisfies the constraint, so does

any of its superset – sum(S.Price) ≥ v is monotone – min(S.Price) ≤ v is monotone

•  Example – C: range(S.profit) ≥ 15 –  Itemset ab satisfies C – So does every superset of ab


TDB (min_sup=2)



Monotonic Constraints Constraint Monotone

v ∈ S yes S ⊆ V yes S ⊆ V no

min(S) ≤ v yes min(S) ≥ v no max(S) ≤ v no max(S) ≥ v yes

count(S) ≤ v no count(S) ≥ v yes

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) no sum(S) ≥ v ( a ∈ S, a ≥ 0 ) yes

range(S) ≤ v no range(S) ≥ v yes

avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ no support(S) ≤ ξ yes


Converting “Tough” Constraints

•  Convert tough constraints into anti-monotone or monotone by properly ordering items

•  Examine C: avg(S.profit) ≥ 25 –  Order items in value-descending order

•  <a, f, g, d, b, h, c, e>

–  If an itemset afb violates C •  So does afbh, afb* •  It becomes anti-monotone!


TDB (min_sup=2)

Convertible Constraints

•  Let R be an order of items •  Convertible anti-monotone

–  If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≤ v w.r.t. item value descending order •  Convertible monotone

–  If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R

–  Ex. avg(S) ≥ v w.r.t. item value descending order


Strongly Convertible Constraints

•  avg(X) ≥ 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> –  Itemset af violates a constraint C, so does

every itemset with af as prefix, such as afd •  avg(X) ≥ 25 is convertible monotone

w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> –  Itemset d satisfies a constraint C, so does

itemsets df and dfa, which having d as a prefix

•  Thus, avg(X) ≥ 25 is strongly convertible

Convertible Constraints

Constraint Convertible anti-monotone

Convertible monotone

Strongly convertible

avg(S) ≤ , ≥ v Yes Yes Yes

median(S) ≤ , ≥ v Yes Yes Yes

sum(S) ≤ v (items could be of any value, v ≥ 0) Yes No No

sum(S) ≤ v (items could be of any value, v ≤ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≥ 0) No Yes No

sum(S) ≥ v (items could be of any value, v ≤ 0) Yes No No


Can Apriori Handle Convertible Constraint? •  A convertible, not monotone nor anti-

monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm –  Within the level wise framework, no direct

pruning based on the constraint can be made –  Itemset df violates constraint C: avg(X)>=25 –  Since adf satisfies C, Apriori needs df to

assemble adf, df cannot be pruned •  But it can be pushed into frequent-pattern

growth framework!

Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10


Mining With Convertible Constraints •  C: avg(S.profit) ≥ 25 •  List of items in every transaction in

value descending order R: –  <a, f, g, d, b, h, c, e> –  C is convertible anti-monotone w.r.t. R

•  Scan transaction DB once –  remove infrequent items

•  Item h in transaction 40 is dropped

–  Itemsets a and f are good

TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e

TDB (min_sup=2)

Item Profit a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30

Not Every Pattern Is Interesting!

•  Trivial patterns – Pregnant à Female 100% confidence

•  Misleading patterns – Play basketball à eat cereal [40%, 66.7%]


Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Evaluation Criteria

•  Objective interestingness measures – Examples: support, patterns formed by mutually

independent items – Domain independent

•  Subjective measures – Examples: domain knowledge, templates/

constraints



Correlation and Lift

•  P(B|A)/P(B) is called the lift of rule A à B

•  Play basketball à eat cereal (lift: 0.89) •  Play basketball à not eat cereal (lift: 1.33)

corrA,B =P(A∪B)P(A)P(B)

=P(AB)

P(A)P(B)

Basketball Not basketball Sum (row)

Cereal 2000 1750 3750

Not cereal 1000 250 1250

Sum(col.) 3000 2000 5000

Contingency table 372 Chapter 6 Association Analysis

Table 6.7. A 2-way contingency table for variables A and B.

B B

A f11 f10 f1+

A f01 f00 f0+

f+1 f+0 N

counts tabulated in a contingency table. Table 6.7 shows an example of acontingency table for a pair of binary variables, A and B. We use the notationA (B) to indicate that A (B) is absent from a transaction. Each entry fij inthis 2 × 2 table denotes a frequency count. For example, f11 is the number oftimes A and B appear together in the same transaction, while f01 is the num-ber of transactions that contain B but not A. The row sum f1+ representsthe support count for A, while the column sum f+1 represents the supportcount for B. Finally, even though our discussion focuses mainly on asymmet-ric binary variables, note that contingency tables are also applicable to otherattribute types such as symmetric binary, nominal, and ordinal variables.

Limitations of the Support-Confidence Framework Existing associa-tion rule mining formulation relies on the support and confidence measures toeliminate uninteresting patterns. The drawback of support was previously de-scribed in Section 6.8, in which many potentially interesting patterns involvinglow support items might be eliminated by the support threshold. The draw-back of confidence is more subtle and is best demonstrated with the followingexample.

Example 6.3. Suppose we are interested in analyzing the relationship be-tween people who drink tea and coffee. We may gather information about thebeverage preferences among a group of people and summarize their responsesinto a table such as the one shown in Table 6.8.

Table 6.8. Beverage preferences among a group of 1000 people.

Coffee Coffee

Tea 150 50 200

Tea 650 150 800

800 200 1000

Property of Lift

•  If A and B are independent, lift = 1 •  If A and B are positively correlated, lift > 1 •  If A and B are negatively correlated, lift < 1 •  Limitation: lift is sensitive to P(A) and P(B)


374 Chapter 6 Association Analysis

Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.

p p r r

q 880 50 930 s 20 50 70

q 50 20 70 s 50 880 930

930 70 1000 70 930 1000

This equation follows from the standard approach of using simple fractionsas estimates for probabilities. The fraction f11/N is an estimate for the jointprobability P (A, B), while f1+/N and f+1/N are the estimates for P (A) andP (B), respectively. If A and B are statistically independent, then P (A, B) =P (A) × P (B), thus leading to the formula shown in Equation 6.6. UsingEquations 6.5 and 6.6, we can interpret the measure as follows:

I(A, B)

⎧⎨

⎩

= 1, if A and B are independent;> 1, if A and B are positively correlated;< 1, if A and B are negatively correlated.

(6.7)

For the tea-coffee example shown in Table 6.8, I = 0.150.2×0.8 = 0.9375, thus sug-

gesting a slight negative correlation between tea drinkers and coffee drinkers.

Limitations of Interest Factor We illustrate the limitation of interestfactor with an example from the text mining domain. In the text domain, itis reasonable to assume that the association between a pair of words dependson the number of documents that contain both words. For example, becauseof their stronger association, we expect the words data and mining to appeartogether more frequently than the words compiler and mining in a collectionof computer science articles.

Table 6.9 shows the frequency of occurrences between two pairs of words,{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factorfor {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troublingfor the following reasons. Although p and q appear together in 88% of thedocuments, their interest factor is close to 1, which is the value when p and qare statistically independent. On the other hand, the interest factor for {r, s}is higher than {p, q} even though r and s seldom appear together in the samedocument. Confidence is perhaps the better choice in this situation because itconsiders the association between p and q (94.6%) to be much stronger thanthat between r and s (28.6%).

lift(p, q) < lift(r, s)!


From Itemsets to Sequences

•  Itemsets: combinations of items, no temporal order •  Temporal order is important in many situations

–  Time-series databases and sequence databases –  Frequent patterns à (frequent) sequential patterns

•  Applications of sequential pattern mining –  Customer shopping sequences:

•  First buy computer, then iPod, and then digital camera, within 3 months.

–  Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures

What Is Sequential Pattern Mining?

•  Given a set of sequences, find the complete set of frequent subsequences

A sequence database A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>


Challenges in Seq Pat Mining

•  A huge number of possible sequential patterns are hidden in databases

•  A mining algorithm should – Find the complete set of patterns satisfying the

minimum support (frequency) threshold – Be highly efficient, scalable, involving only a

small number of database scans – Be able to incorporate various kinds of user-

specific constraints

Apriori Property of Seq Patterns

•  Apriori property in sequential patterns –  If a sequence S is infrequent, then none of the

super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and

<(ah)b>

Given support threshold min_sup =2

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>


GSP

•  GSP (Generalized Sequential Pattern) mining •  Outline of the method

–  Initially, every item in DB is a candidate of length-1 –  For each level (i.e., sequences of length-k) do

•  Scan database to collect support count for each candidate sequence

•  Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori

–  Repeat until no frequent sequence or no candidate can be found

•  Major strength: Candidate pruning by Apriori

Finding Len-1 Seq Patterns

•  Initial candidates – <a>, , <c>, <d>, <e>, <f>, <g>,

<h> •  Scan database once

– count support for candidates

min_sup =2

Cand Sup <a> 3 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1

Seq-id Sequence

10 <(bd)cb(ac)>

20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

Generating Length-2 Candidates <a> <c> <d> <e> <f>

<a> <aa> <ab> <ac> <ad> <ae> <af> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff>

<a> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>

51 length-2 Candidates

Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes

44.57% candidates


Finding Len-2 Seq Patterns

•  Scan database one more time, collect support count for each length-2 candidate

•  There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns

Generating Length-3 Candidates and Finding Length-3 Patterns •  Generate Length-3 Candidates

– Self-join length-2 sequential patterns •  <ab>, <aa> and <ba> are all length-2 sequential

patterns à <aba> is a length-3 candidate •  <(bd)>, <bb> and <db> are all length-2 sequential

patterns à <(bd)b> is a length-3 candidate – 46 candidates are generated

•  Find Length-3 Sequential Patterns – Scan database once more, collect support

counts for candidates – 19 out of 46 candidates pass support threshold

The GSP Mining Process

<a> <c> <d> <e> <f> <g> <h>

<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>

<abb> <aab> <aba> <baa> <bab> …

<abba> <(bd)bc> …

<(bd)cba>

1st scan: 8 cand. 6 length-1 seq. pat.

2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all

3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all

4th scan: 8 cand. 6 length-4 seq. pat.

5th scan: 1 cand. 1 length-5 seq. pat.

Cand. cannot pass sup. threshold

Cand. not in DB at all

min_sup =2

Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>

The GSP Algorithm

•  Take sequences in form of <x> as length-1 candidates

•  Scan database once, find F1, the set of length-1 sequential patterns

•  Let k=1; while Fk is not empty do –  Form Ck+1, the set of length-(k+1) candidates from Fk; –  If Ck+1 is not empty, scan database once, find Fk+1, the

set of length-(k+1) sequential patterns –  Let k=k+1;


Bottlenecks of GSP

•  A huge set of candidates – 1,000 frequent length-1 sequences generate

length-2 candidates! •  Multiple scans of database in mining •  Real challenge: mining long sequential

patterns – An exponential number of short candidates – A length-100 sequential pattern needs 1030

candidate sequences!

500,499,12999100010001000 =

×+×

30100100

11012

100≈−=⎟⎟

⎠

⎞⎜⎜⎝

⎛∑=i i

FreeSpan: Freq Pat-projected Sequential Pattern Mining •  The itemset of a seq pat must be frequent

– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns

– Mine each projected database to find its patterns f_list: b:5, c:4, a:3, d:3, e:3, f:2

All seq. pat. can be divided into 6 subsets: • Seq. pat. containing item f • Those containing e but no f • Those containing d but no e nor f • Those containing a but no d, e or f • Those containing c but no a, d, e or f • Those containing only item b

Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >


From FreeSpan to PrefixSpan

•  Freespan: – Projection-based: no candidate sequence needs

to be generated – But, projection can be performed at any point in

the sequence, and the projected sequences may not shrink much

•  PrefixSpan – Projection-based – But only prefix-based projection: less projections

and quickly shrinking sequences

Prefix and Suffix (Projection)

•  <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>

•  Given sequence <a(abc)(ac)d(cf)>

Prefix Suffix (Prefix-Based Projection)

<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>

Mining Sequential Patterns by Prefix Projections •  Step 1: find length-1 sequential patterns

– <a>, , <c>, <d>, <e>, <f> •  Step 2: divide search space. The complete

set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix <a>; – The ones having prefix ; – … – The ones having prefix <f>

Finding Seq. Pat. with Prefix <a>

•  Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>,

<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> •  Find all the length-2 seq. pat. having prefix

<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets

•  Having prefix <aa>; • … •  Having prefix <af>

Completeness of PrefixSpan

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

SDB

Length-1 sequential patterns <a>, , <c>, <d>, <e>, <f>

<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>

Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>

Having prefix <a>

Having prefix <aa>

<aa>-proj. db … <af>-proj. db

Having prefix <af>

-projected database … Having prefix 

Having prefix <c>, …, <f>

… …


Efficiency of PrefixSpan

•  No candidate sequence needs to be generated

•  Projected databases keep shrinking •  Major cost of PrefixSpan: constructing

projected databases – Can be improved by bi-level projections

Effectiveness

•  Redundancy due to anti-monotonicity –  {<abcd>} leads to 15 sequential patterns of

same support – Closed sequential patterns and sequential

generators •  Constraints on sequential patterns

– Gap – Length – More sophisticated, application oriented

constraints

Data Warehousing & OLAP

Motivation: Business Intelligence

Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 143

Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)

Product information (Product-id, category, manufacturer, made-in, stock-price, …)

Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)

Business queries: •  Which categories of products are most popular for customers in Vancouver? •  Find pairs (customer groups, most popular products)


Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …

In what aspect is he most similar to cases of coronary artery disease

and, at the same time, dissimilar to adiposity?

Don’t You Ever Google Yourself?

•  Big data makes one know oneself better •  57% American adults search themselves on

Internet – Good news: those people are

better paid than those who haven’t done so! (Investors.com)

•  Egocentric analysis becomes more and more important with big data


Egocentric Analysis

•  How am I different from (more often than not, better than) others?

•  In what aspects am I good?


http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg

Dimensions

•  “An aspect or feature of a situation, problem, or thing, a measurable extent of some kind”

– Dictionary •  Dimensions/attributes are used to model

complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/

attributes •  More often than not, objects have too many

dimensions/attributes than one is interested in and can handle


Multi-dimensional Analysis

•  Find interesting patterns in multi-dimensional subspaces –  “Michael Jordan is outstanding in subspaces (total

points, total rebounds, total assists) and (number of games played, total points, total assists)”

•  Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):

select a subset of relevant features for use in model construction – a set of features for all objects

– Different subspaces may manifest different patterns



OLAP

•  Conceptually, we may explore all possible subspaces for interesting patterns

•  What patterns are interesting? •  How can we explore all possible subspaces

systematically and efficiently? •  Fundamental problems in analytics and data

mining


OLAP

•  Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; –  In TPC, 6 standard benchmarks have 83 queries,

aggregates are used 59 times, group-bys are used 20 times

•  Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently


OLAP Operations

•  Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction –  (Day, Store, Product type, SUM(sales) à

(Month, City, *, SUM(sales)) •  Drill down (roll down): reverse of roll-up,

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Roll Up


http://www.tutorialspoint.com/dwh/images/rollup.jpg

Drill Down


http://www.tutorialspoint.com/dwh/images/drill_down.jpg

Other Operations

•  Dice: pick specific values or ranges on some dimensions

•  Pivot: “rotate” a cube – changing the order of dimensions in visual analysis


http://en.wikipedia.org/wiki/File:OLAP_pivoting.png

Dice


http://www.tutorialspoint.com/dwh/images/dice.jpg


Relational Representation

•  If there are n dimensions, there are 2n possible aggregation columns

Roll up by model by year by color in a table


Difficulties

•  Many group bys are needed – 6 dimensions à 26=64 group bys

•  In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!


Dummy Value “ALL”


CUBE

SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

DATA CUBE Model Year Color Sales

CUBE

Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);


Semantics of ALL

•  ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}


OLTP Versus OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support DB design application-oriented subject-oriented

data current, up-to-date, detailed, flat relational Isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write, index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed

tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response


What Is a Data Warehouse?

•  “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”

– W. H. Inmon •  Data warehousing: the process of

constructing and using data warehouses


Subject-Oriented

•  Organized around major subjects, such as customer, product, sales

•  Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

•  Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process


Integrated

•  Integrating multiple, heterogeneous data sources –  Relational databases, flat files, on-line transaction

records •  Data cleaning and data integration

–  Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

•  E.g., Hotel price: currency, tax, breakfast covered, etc.

–  When data is moved to the warehouse, it is converted


Time Variant

•  The time horizon for the data warehouse is significantly longer than that of operational systems –  Operational databases: current value data –  Data warehouse data: provide information from a

historical perspective (e.g., past 5-10 years) •  Every key structure in the data warehouse contains

an element of time, explicitly or implicitly –  But the key of operational data may or may not contain “time element”


Nonvolatile

•  A physically separate store of data transformed from the operational environment

•  Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,

and concurrency control mechanisms – Require only two operations in data accessing

•  Initial loading of data •  Access of data


Why Separate Data Warehouse?

•  High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP

•  Different functions and different data – Historical data: data analysis often uses

historical data that operational databases do not typically maintain

– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources

Data Warehouse Schema Design

•  Query answering efficiency – Subject orientation –  Integration

•  Tradeoff between time and space – Universal table versus fully normalized schema



Star Schema

time_key day day_of_the_week month quarter year

time

location_key street city state_or_province country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch


Snowflake Schema


time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item


branch

supplier_key supplier_type

supplier

city_key city state_or_province country

city

Fact Constellation


time

location_key street city province_or_state country

location

Sales Fact Table

time_key item_key branch_key location_key units_sold dollars_sold avg_sales

Measures

item_key item_name brand type supplier_type

item


branch

Shipping Fact Table

time_key item_key shipper_key from_location

to_location dollars_cost units_shipped

shipper_key shipper_name location_key shipper_type

shipper



(Good) Aggregate Functions

•  Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,…n}) –  Examples: COUNT(), MIN(), MAX(), SUM() –  G=SUM() for COUNT()

•  Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n }) –  Examples: AVG(), standard deviation, MaxN(), MinN() –  For AVG(), G() records sum and count, H() adds these

two components and divides to produce the global average


Holistic Aggregate Functions

•  There is no constant bound on the size of the storage needed to describe a sub-aggregate. – There is no constant M, such that an M-tuple

characterizes the computation F({Xi,j |i=1,...,I}).

•  Examples: Median(), MostFrequent() (also called the Mode()), and Rank()


Index Requirements in OLAP

•  Data is read only –  (Almost) no insertion or deletion

•  Query types – Point query: looking up one specific tuple (rare) – Range query: returning the aggregate of a

(large) set of tuples, with group by – Complex queries: need specific algorithms and

index structures, will be discussed later


OLAP Query Example

•  In table (cust, gender, …), find the total number of male customers

•  Method 1: scan the table once •  Method 2: build a B+ tree index on attribute

gender, still need to access all tuples of male customers

•  Can we get the count without scanning many tuples, even not all tuples of male customers?


Bitmap Index

•  For n tuples, a bitmap index has n bits and can be packed into ⎡n /8⎤ bytes and ⎡n /32⎤ words

•  From a bit to the row-id: the j-th bit of the p-th byte à row-id = p*8 +j cust gender …

Jack M … Cathy F … … … …

Nancy F …

1 0 … 0

Using Bitmap to Count

•  Shcount[] contains the number of bits in the entry subscript – Example: shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]];


Advantages of Bitmap Index

•  Efficient in space •  Ready for logic composition

– C = C1 AND C2 – Bitmap operations can be used

•  Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent

the state of a customer in US – How to represent a sale in dollars?


Bit-Sliced Index

•  A sale amount can be written as an integer number of pennies, and then be represented as a binary number of N bits – 24 bits is good for up to $167,772.15,

appropriate for many stores •  A bit-sliced index is N bitmaps

– Tuple j sets in bitmap k if the k-th bit in its binary representation is on

– The space costs of bit-sliced index is the same as storing the data directly


Using Indexes

SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B

•  Direct access to rows to calculate SUM: scan the whole table once

•  B+ tree: find the tuples from the tree •  Projection index: only scan attribute sales •  Bit-sliced index: get the sum from ∑(B AND

Bk)*2k


Cost Comparison

•  Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP

•  Bit-sliced index is efficient in I/O •  Other case studies in [O’Neil and Quass,

SIGMOD’97]


Horizontal or Vertical Storage

•  A fact table for data warehousing is often fat –  Tens of even hundreds of dimensions/attributes

•  A query is often about only a few attributes •  Horizontal storage: tuples are stored one by one •  Vertical storage: tuples are stored by attributes

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100

A1 A2 … A100

x1 x2 … x100

… … … … z1 z2 … z100


Horizontal Versus Vertical

•  Find the information of tuple t –  Typical in OLTP –  Horizontal storage: get the whole tuple in one search –  Vertical storage: search 100 lists

•  Find SUM(a100) GROUP BY {a22, a83} –  Typical in OLAP –  Horizontal storage (no index): search all tuples O(100n),

where n is the number of tuples –  Vertical storage: search 3 lists O(3n), 3% of the

horizontal storage method •  Projection index: vertical storage


MOLAP

Date

Cou

ntry

sum

sum TV

VCR PC

1Qtr 2Qtr 3Qtr 4Qtr U.S.A

Canada

Mexico

sum


Pros and Cons

•  Easy to implement •  Fast retrieval •  Many entries may be empty if data is sparse •  Costly in space


ROLAP – Data Cube in Table

•  A multi-dimensional database Base table

Dimensions Measure Store Product Season AVG(Sales)

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9

Dimensions Measure Store Product Season Sales

S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9

Cubing

Data Cube: A Lattice of Cuboids


time,item

time,item,location

time, item, location, supplierc

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Data Cube: A Lattice of Cuboids


•  Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells (9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *), (*, milk, Urbana, *), (*, milk, Urbana, *) (*, milk, Chicago, *), (*, milk, *, *)

all

time,item

time,item,location

time, item, location, supplier

time item location supplier

time,location time,supplier

item,location item,supplier

location,supplier

time,item,supplier time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Full Cube vs. Iceberg Cube

•  Full cube vs. iceberg cube compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support


n  Avoidexplosivegrowth:Acubewith100dimensionsn  2basecells:(a1,a2,….,a100),(b1,b2,…,b100)n  Howmanyaggregatecellsif“havingcount>=1”?n  Whatabout“havingcount>=2”?

iceberg condition

Multi-Way Array Aggregation

•  Array-based “bottom-up” algorithm

•  Using multi-dimensional chunks •  No direct tuple comparisons •  Simultaneous aggregation on

multiple dimensions •  Intermediate aggregate values

are re-used for computing ancestor cuboids

•  Cannot do Apriori pruning: No iceberg optimization


ABC

AB

A

All

B

AC BC

C

Multi-way Array Aggregation for Cube Computation (MOLAP)


•  Partition arrays into chunks (a small subcube which fits in memory). •  Compressed sparse array addressing: (chunk_id, offset) •  Compute aggregates in “multiway” by visiting cube cells in the order which

minimizes the # of times to visit each cell, and reduces memory access & storage cost.

What is the best traversing order to do multi-way aggregation?

A

B 29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1 a0

c3 c2

c1 c 0

b3

b2

b1

b0 a2 a3

C

B

4428 5640

24 523620

60

Multi-way Array Aggregation for Cube Computation (3-D to 2-D)

all

A B

AB

ABC

AC BC

C

•  The best order is the one that minimizes the memory requirement and reduced I/Os

ABC

AB

A

All

B

AC BC

C


Multi-way Array Aggregation for Cube Computation (2-D to 1-D)

ABC

AB

A

All

B

AC BC

C


Multi-Way Array Aggregation for Cube Computation •  Method: the planes should be sorted and

computed according to their size in ascending order –  Idea: keep the smallest plane in the main memory,

fetch and compute only one chunk at a time for the largest plane

•  Limitation of the method: computing well only for a small number of dimensions –  If there are a large number of dimensions, “top-

down” computation and iceberg cube computation methods can be explored



Iceberg Cube

•  In a data cube, many aggregate cells are trivial – Having an aggregate too small

•  Iceberg query

Monotonic Iceberg Condition

•  If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c

•  For cells c1 and c2, c1 is called an ancestor of c2 if in all dimensions that c1 takes a non-* value, c2 agrees with c1 –  (a,b,*) is an ancestor of (a,b,c)

•  An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P


BUC

•  Once a base table (A, B, C) is sorted by A-B-C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters

•  To compute other aggregates, we can sort the base table in some other orders

Example


Location Year Color Amount Vancouver 2015 Yellow 300 Victoria 2014 Red 400 Seattle 2015 Green 120 Vancouver 2014 Green 260 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160

Threshold: sum() >= 300

Example: Sorting on Location


Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2015 Yellow 300 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2014 Green 260 Victoria 2014 Red 400

Sum(Seattle, *, *) = 280 ✗ Sum(Vancouver, *, *) = 1000 ✓ Sum(Victoria, *, *) = 400 ✓

Sorting on Year for Vancouver


Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2014 Green 260 Vancouver 2015 Yellow 300 Vancouver 2015 Red 160 Victoria 2014 Red 400

Sum(Vancouver, 2014, *) = 540 ✓ Sum(Vancouver, 2015, *) = 460 ✓

Color on Vancouver & 2014/2015


Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2015 Yellow 300 Victoria 2014 Red 400

Sum(Vancouver, 2014, Yellow) = 280 ✗ Sum(Vancouver, 2014, Green) = 260 ✗ Sum(Vancouver, 2015, Yellow) = 300 ✓Sum(Vancouver, 2015, Red) = 160 ✗

Sort on Color for Vancouver


Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Yellow 300 Victoria 2014 Red 400

Sum(Vancouver, *, Green) = 260 ✗ Sum(Vancouver, *, Red) = 160 ✗ Sum(Vancouver, *, Yellow) = 580 ✓


How to Sort the Base Table?

•  General sorting in main memory O(nlogn) •  Counting in main memory O(n), linear to the

number of tuples in the base table – How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0’s – Scan the integers once, count the occurrences

of each value in 1 to 100 – Scan the integers again, put the integers to the

right places


Pushing Monotonic Conditions

•  BUC searches the aggregates bottom-up in depth-first manner

•  Only when a monotonic condition holds, the descendants of the current node should be expanded

Clustering

Community Detection

Jian Pei: CMPT 741/459 Clustering (1) 206

http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811

Customer Relation Management

•  Partitioning customers into groups such that customers within a group are similar in some aspects

•  A manager can be assigned to a group •  Customized products and services can be

developed



What Is Clustering?

•  Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes

Cluster 1 Cluster 2

Outliers


Requirements of Clustering

•  Scalability •  Ability to deal with various types of attributes •  Discovery of clusters with arbitrary shape •  Minimal requirements for domain knowledge

to determine input parameters


Data Matrix

•  For memory-based clustering – Also called object-by-variable structure

•  Represents n objects with p variables (attributes, measures) – A relational table

⎥⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢⎢

⎣

⎡

npxnfxnx

ipxifxix

pxfxx

!!"""""

!!"""""

!!

1

1

1111


Dissimilarity Matrix

•  For memory-based clustering – Also called object-by-object structure – Proximities of pairs of objects – d(i, j): dissimilarity between objects i and j – Nonnegative – Close to 0: similar

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

0,2)(,1)(

0(3,2)(3,1)0(2,1)

0

!!"""

ndnd

ddd


How Good Is Clustering?

•  Dissimilarity/similarity depends on distance function – Different applications have different functions

•  Judgment of clustering quality is typically highly subjective


Types of Data in Clustering

•  Interval-scaled variables •  Binary variables •  Nominal, ordinal, and ratio variables •  Variables of mixed types


Interval-valued Variables

•  Continuous measurements of a roughly linear scale – Weight, height, latitude and longitude

coordinates, temperature, etc. •  Effect of measurement units in attributes

– Smaller unit à larger variable range à larger effect to the result

– Standardization + background knowledge


Standardization

•  Calculate the mean absolute deviation

•  Calculate the standardized measurement (z-score)

•  Mean absolute deviation is more robust – The effect of outliers is reduced but remains

detectable

|)|...|||(|1 21 fnffffff mxmxmxns −++−+−= .)...211

nffff xx(xn m +++=

f

fifif s

mx z

−=


Similarity and Dissimilarity

•  Distances are normally used measures •  Minkowski distance: a generalization

•  If q = 2, d is Euclidean distance •  If q = 1, d is Manhattan distance •  If q = ∞, d is Chebyshev distance •  Weighed distance

)0(||...||||),(2211

>−++−+−= qjxixjxixjxixjid qq

pp

qq

)0()||...||2

||1

),(2211

>−++−+−= qjxixpwjxixwjxixwjid qq

pp

qq


Manhattan and Chebyshev Distance

Picture from Wekipedia

Manhattan Distance

http://brainking.com/images/rules/chess/02.gif

Chebyshev Distance

When n = 2, chess-distance


Properties of Minkowski Distance

•  Nonnegative: d(i,j) ≥ 0 •  The distance of an object to itself is 0

– d(i,i) = 0 •  Symmetric: d(i,j) = d(j,i) •  Triangular inequality

– d(i,j) ≤ d(i,k) + d(k,j) i j

k


Binary Variables

•  A contingency table for binary data •  Symmetric variable: each state carries the

same weight –  Invariant similarity

•  Asymmetric variable: the positive value carries more weight – Noninvariant similarity (Jacard)

tsrqsr jid +++

+=),(

srqsr jid ++

+=),(

Object j

Object i

1 0 Sum 1 q r q+r 0 s t s+t

Sum q+s r+t p


Nominal Variables

•  A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green

•  Method 1: simple matching – M: # of matches, p: total # of variables

•  Method 2: use a large number of binary variables – Creating a new binary variable for each of the M

nominal states

pmpjid −=),(


Ordinal Variables

•  An ordinal variable can be discrete or continuous

•  Order is important, e.g., rank •  Can be treated like interval-scaled

– Replace xif by their rank – Map the range of each variable onto [0, 1] by

replacing the i-th object in the f-th variable by

– Compute the dissimilarity using methods for interval-scaled variables

11−−

=f

ifif M

rz

},...,1{ fif Mr ∈


Ratio-scaled Variables

•  Ratio-scaled variable: a positive measurement on a nonlinear scale – E.g., approximately at exponential scale, such

as AeBt •  Treat them like interval-scaled variables?

– Not a good choice: the scale can be distorted! •  Apply logarithmic transformation, yif = log(xif) •  Treat them as continuous ordinal data, treat

their rank as interval-scaled


Variables of Mixed Types

•  A database may contain all the six types of variables – Symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio •  One may use a weighted formula to combine

their effects

)(1

)()(1),(

fij

pf

fij

fij

pf d

jidδ

δ

=

=

Σ

Σ=

Clustering Methods

•  K-means and partitioning methods •  Hierarchical clustering •  Density-based clustering •  Grid-based clustering •  Pattern-based clustering •  Other clustering methods



Partitioning Algorithms: Ideas

•  Partition n objects into k clusters –  Optimize the chosen partitioning criterion

•  Global optimal: examine all possible partitions –  (kn-(k-1)n-…-1) possible partitions, too expensive!

•  Heuristic methods: k-means and k-medoids –  K-means: a cluster is represented by the center –  K-medoids or PAM (partition around medoids): each

cluster is represented by one of the objects in the cluster


K-means

•  Arbitrarily choose k objects as the initial cluster centers

•  Until no change, do –  (Re)assign each object to the cluster to which

the object is the most similar, based on the mean value of the objects in the cluster

– Update the cluster means, i.e., calculate the mean value of the objects for each cluster


K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2

Arbitrarily choose K object as initial cluster center

Assign each object to the most similar center

Update the cluster means


reassign reassign

Jian Pei: Data Mining -- Clustering and Outlier Detection 33

K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2


Assign each objects to most similar center



reassignreassign


K-Means: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2


Assign each objects to most similar center



reassignreassign

Pros and Cons of K-means

•  Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t <<

n. •  Often terminate at a local optimum •  Applicable only when mean is defined

– What about categorical data? •  Need to specify the number of clusters •  Unable to handle noisy data and outliers •  Unsuitable to discover non-convex clusters


Variations of the K-means •  Aspects of variations

–  Selection of the initial k means –  Dissimilarity calculations –  Strategies to calculate cluster means

•  Handling categorical data: k-modes –  Use mode instead of mean

•  Mode: the most frequent item(s) –  A mixture of categorical and numerical data: k-prototype

method •  EM (expectation maximization): assign a

probability of an object to a cluster (will be discussed later)


A Problem of K-means

•  Sensitive to outliers – Outlier: objects with extremely large values

•  May substantially distort the distribution of the data

•  K-medoids: the most centrally located object in a cluster

+ +


A Problem of K-means

• Sensitive to outliers– Outlier: objects with extremely large values

• May substantially distort the distribution of the data

• K-medoids: the most centrally located object in a cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

++

PAM: A K-medoids Method

•  PAM: partitioning around Medoids •  Arbitrarily choose k objects as the initial medoids •  Until no change, do

–  (Re)assign each object to the cluster to which the nearest medoid

–  Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’

–  If S < 0 then swap o with o’ to form the new set of k medoids


Swapping Cost

•  Measure whether o’ is better than o as a medoid

•  Use the squared-error criterion

– Compute Eo’-Eo

– Negative: swapping brings benefit

∑∑= ∈

=k

i Cpi

i

opdE1

2),(


PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

K=2

Arbitrary choose k object as initial medoids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to nearest medoids

Randomly select a nonmedoid object,Oramdom

Compute total cost of swapping

Total Cost = 26

Swapping O and Oramdom

If quality is improved.

Do loop

Until no change


PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10




0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26



Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10




0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26



Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


PAM: Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 20

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

K=2


0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10




0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Total Cost = 26



Do loop

Until no change

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


Pros and Cons of PAM

•  PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers

•  PAM is efficient for small data sets but does not scale well for large data sets – O(k(n-k)2) for each iteration

Hierarchy

•  An arrangement or classification of things according to inclusiveness

•  A natural way of abstraction, summarization, compression, and simplification for understanding

•  Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality

of the hierarchy Jian Pei: CMPT 459/741 Clustering (2) 235

•  Group data objects into a tree of clusters •  Top-down versus bottom-up


Hierarchical Clustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

d c

e

a a b

d e c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative (AGNES)

divisive (DIANA)


AGNES (Agglomerative Nesting)

•  Initially, each object is a cluster •  Step-by-step cluster merging, until all objects

form a cluster – Single-link approach – Each cluster is represented by all of the objects

in the cluster – The similarity between two clusters is measured

by the similarity of the closest pair of data points belonging to different clusters


Dendrogram

•  Show how to merge clusters hierarchically

•  Decompose data objects into a multi-level nested partitioning (a tree of clusters)

•  A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster


DIANA (Divisive ANAlysis)

•  Initially, all objects are in one cluster •  Step-by-step splitting clusters until each

cluster contains only one object

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


Distance Measures

•  Minimum distance •  Maximum distance •  Mean distance •  Average distance

∑∑∈ ∈

∈∈

∈∈

=

=

=

=

i j

ji

ji

Cp Cqjijiavg

jijimean

CqCpji

CqCpji

qpdnn

CCd

mmdCCd

qpdCCd

qpdCCd

),(1),(

),(),(

),(max),(

),(min),(

,max

,min

m: mean for a cluster C: a cluster n: the number of objects in a cluster


Challenges

•  Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical

•  High complexity O(n2) •  Integrating hierarchical clustering with other

techniques – BIRCH, CURE, CHAMELEON, ROCK


BIRCH

•  Balanced Iterative Reducing and Clustering using Hierarchies

•  CF (Clustering Feature) tree: a hierarchical data structure summarizing object information – Clustering objects à clustering leaf nodes of the

CF tree


Clustering Feature: CF = (N, LS, SS)

N: Number of data points

LS: ∑Ni=1=oi

SS: ∑Ni=1=oi

2

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)

(4,7)

(3,8)

Clustering Feature Vector


CF-tree in BIRCH

•  Clustering features –  Summarize the statistics for a cluster –  Many cluster quality measures (e.g., radium, distance)

can be derived –  Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)

•  A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering –  A nonleaf node in a tree has descendants or “children” –  The nonleaf nodes store sums of the CFs of children


CF Tree CF1 child1

CF3 child3

CF2 child2

CF6 child6

CF1 child1

CF3 child3

CF2 child2

CF5 child5

CF1

CF2

CF6

prev next CF1

CF2

CF4

prev next

B = 7 L = 6

Root

Non-leaf node

Leaf node Leaf node


Parameters of a CF-tree

•  Branching factor: the maximum number of children

•  Threshold: max diameter of sub-clusters stored at the leaf nodes


BIRCH Clustering

•  Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)

•  Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree


Pros & Cons of BIRCH

•  Linear scalability – Good clustering with a single scan – Quality can be further improved by a few

additional scans •  Can handle only numeric data •  Sensitive to the order of the data records


Distance-based Methods: Drawbacks

•  Hard to find clusters with irregular shapes •  Hard to specify the number of clusters •  Heuristic: a cluster must be dense

How to Find Irregular Clusters?

•  Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster

•  Start from a dense area, traverse connected dense areas and discover clusters in irregular shape



Directly Density Reachable

•  Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-

neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}

•  Core object p: |NEps(p)|≥MinPts – A core object is in a dense area

•  Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object

p q

MinPts = 3 Eps = 1 cm


Density-Based Clustering

•  Density-reachable –  Directly density reachable p1àp2, p2àp3, …, pn-1à pn –  pn density-reachable from p1

•  Density-connected –  If points p, q are density-reachable from o then p and q

are density-connected

p q

o

p

q p1


DBSCAN

•  A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial

databases with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5


DBSCAN: the Algorithm

•  Arbitrary select a point p •  Retrieve all points density-reachable from p

wrt Eps and MinPts •  If p is a core point, a cluster is formed •  If p is a border point, no points are density-

reachable from p and DBSCAN visits the next point of the database

•  Continue the process until all of the points have been processed


Challenges for DBSCAN

•  Different clusters may have very different densities

•  Clusters may be in hierarchies

Biclustering

•  Clustering both objects and attributes simultaneously

•  Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of

attributes – An object may participate in multiple biclusters

or no biclusters – An attribute may be involved in multiple

biclusters, or no biclusters


Application Examples

•  Recommender systems – Objects: users – Attributes: items – Values: user ratings

•  Microarray data – Objects: genes – Attributes: samples – Values: expression levels


nmw

gene

sample/condition

w11w

21w31w

n1w

12w

32w22w

n2w

1mw

3mw2m

Biclusters with Constant Values


11.2. CLUSTERING HIGH-DIMENSIONAL DATA 535

· · · b6 · · · b12 · · · b36 · · · b99 · · ·a1 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a33 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a86 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.

subset of products. For example, AllElectronics is highly interested in findinga group of customers who all like the same group of products. Such a clusteris a submatrix in the customer-product matrix, where all elements have a highvalue. Using such a cluster, AllElectronics can make recommendations in twodirections. First, the company can recommend products to new customerswho are similar to the customers in the cluster. Second, the company canrecommend to customers new products that are similar to those involved inthe cluster.

As with bi-clusters in a gene expression data matrix, the bi-clusters in acustomer-product matrix usually have the following characteristics:

• Only a small set of customers participate in a cluster;

• A cluster involves only a small subset of products;

• A customer can participate in multiple clusters, or may not participatein any cluster at all; and

• A product may be involved in multiple clusters, or may not be involvedin any cluster at all.

Bi-clustering can be applied to customer-product matrices to mine clusterssatisfying the above requirements.

Types of Bi-clusters

“How can we model bi-clusters and mine them?” Let’s start with some basicnotation. For the sake of simplicity, we’ll use “genes” and “conditions” torefer to the two dimensions in our discussion. Our discussion can easily beextended to other applications. For example, we can simply replace “genes” and“conditions” by “customers” and “products” to tackle the customer-product bi-clustering problem.

Let A = {a1, . . . , an} be a set of genes and B = {b1, . . . , bm} be a set ofconditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-condition matrix, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. A submatrix I × J is

536 CHAPTER 11. ADVANCED CLUSTER ANALYSIS

10 10 10 10 1020 20 20 20 2050 50 50 50 500 0 0 0 0

Figure 11.6: A bi-cluster with constant values on rows.

10 50 30 70 2020 60 40 80 3050 90 70 110 600 40 20 60 10

Figure 11.7: A bi-cluster with coherent values.

defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. Forexample, in the matrix shown in Figure 11.5, {a1, a33, a86} × {b6, b12, b36, b99}is a submatrix.

A bi-cluster is a submatrix where genes and conditions follow consistentpatterns. We can define different types of bi-clusters based on such patterns:

• As the simplest case, a submatrix I × J (I ⊆ A, J ⊆ B) is a bi-clusterwith constant values if for any i ∈ I and j ∈ J , eij = c, where c is aconstant. For example, the submatrix {a1, a33, a86}× {b6, b12, b36, b99} inFigure 11.5 is a bi-cluster with constant values.

• A bi-cluster is interesting if each row has a constant value, though differ-ent rows may have different values. A bi-cluster with constant valueson rows is a submatrix I × J such that for any i ∈ I and j ∈ J , theneij = c+αi where αi is the adjustment for row i. For example, Figure 11.6shows a bi-cluster with constant values on rows.

Symmetrically, a bi-cluster with constant values on columns is asubmatrix I × J such that for any i ∈ I and j ∈ J , then eij = c + βj ,where βj is the adjustment for column j.

• More generally, a bi-cluster is interesting if the rows change in a syn-chronized way with respect to the columns and vice versa. Mathemat-ically, a bi-cluster with coherent values (also known as a pattern-based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J ,eij = c + αi + βj , where αi and βj are the adjustment for row i andcolumn j, respectively. For example, Figure 11.7 shows a bi-cluster withcoherent values.

It can be shown that I × J is a bi-cluster with coherent values if andonly if for any i1, i2 ∈ I and j1, j2 ∈ J , then ei1j1 − ei2j1 = ei1j2 − ei2j2 .Moreover, instead of using addition, we can define bi-cluster with coherent

On rows

Biclusters with Coherent Values

•  Also known as pattern-based clusters


Biclusters with Coherent Evolutions

•  Only up- or down-regulated changes over rows or columns


11.2. CLUSTERING HIGH-DIMENSIONAL DATA 537

10 50 30 70 2020 100 50 1000 3050 100 90 120 800 80 20 100 10

Figure 11.8: A bi-cluster with coherent evolutions on rows.

values using multiplication, that is eij = c · αi · βj . Clearly, bi-clusterswith constant values on rows or columns are special cases of bi-clusterswith coherent values.

• In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining theexact values. A bi-cluster with coherent evolutions on rows is asubmatrix I × J such that for any i1, i2 ∈ I and j1, j2 ∈ J , (ei1j1 −ei1j2)(ei2j1 − ei2j2) ≥ 0. For example, Figure 11.8 shows a bi-cluster withcoherent evolutions on rows. Symmetrically, we can define bi-clusterswith coherent evolutions on columns.

Next, we study how to mine bi-clusters.

Bi-clustering Methods

The above specification of the types of bi-clusters only considers ideal cases. Inreal data sets, such perfect bi-clusters rarely exist. When they do exist, theyare usually very small. Instead, random noise can affect the readings of eij andthus prevent a bi-cluster in nature from appearing in a perfect shape.

There are two major types of methods for discovering bi-clusters in datathat may come with noise. Optimization-based methods conduct an it-erative search. At each iteration, the submatrix with the highest significancescore is identified as a bi-cluster. The process terminates when a user-specifiedcondition is met. Due to cost concerns in computation, greedy search is oftenemployed to find local optimal bi-clusters. Enumeration methods use a tol-erance threshold to specify the degree of noise allowed in the bi-clusters to bemined, and then tries to enumerate all submatrices of bi-clusters that satisfythe requirements. We use the δ-Cluster and MaPle algorithms as examples toillustrate these ideas.

Optimization Using the δ-Cluster Algorithm

For a submatrix, I × J , the mean of the i-th row is

eiJ =1

|J |∑

j∈J

eij . (11.16)

Coherent evolutions on rows


Differences from Subspace Clustering

•  Subspace clustering uses global distance/similarity measure

•  Pattern-based clustering looks at patterns •  A subspace cluster according to a globally

defined similarity measure may not follow the same pattern


Objects Follow the Same Pattern?

pScore

D1 D2

Objectblue

Obejctgreen

The less the pScore, the more consistent the objects


Pattern-based Clusters

•  pScore: the similarity between two objects rx, ry on two attributes au, av

•  δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D,

)..()..(....

vyvxuyuxvyuy

vxux arararararararar

pScore −−−=⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡

)0(....

≥≤⎟⎟⎠

⎞⎜⎜⎝

⎛⎥⎦

⎤⎢⎣

⎡δδ

vyuy

vxux

arararar

pScore


Maximal pCluster

•  If (R, D) is a δ-pCluster , then every sub-cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D – An anti-monotonic property – A large pCluster is accompanied with many

small pClusters! Inefficacious •  Idea: mining only the maximal pClusters!

– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster


Mining Maximal pClusters

•  Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino

•  Task: mine the complete set of significant maximal δ-pClusters – A significant δ-pCluster has at least mino objects

on at least mina attributes


Grid-based Clustering Methods

•  Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters

•  Several interesting methods – CLIQUE – STING – WaveCluster


CLIQUE

•  Clustering In QUEst •  Automatically identify subspaces of a high

dimensional data space •  Both density-based and grid-based


CLIQUE: the Ideas

•  Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-

overlapping rectangular units •  A unit is dense if the number of data points

in the unit exceeds a threshold •  A cluster is a maximal set of connected

dense units within a subspace


CLIQUE: the Method

•  Partition the data space and find the number of points in each cell of the partition –  Apriori: a k-d cell cannot be dense if one of its (k-1)-d

projection is not dense •  Identify clusters:

–  Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests

•  Generate minimal description for the clusters –  Determine the minimal cover for each cluster


Sala

ry

(10,

000)

age

Vaca

tion

30 50

20 30 40 50 60 age

5 4

3 1

2 6

7 0

Vaca

tion

(wee

k)

20 30 40 50 60 age

5 4

3 1

2 6

7 0

CLIQUE: An Example


CLIQUE: Pros and Cons

•  Automatically find subspaces of the highest dimensionality with high density clusters

•  Insensitive to the order of input – Not presume any canonical data distribution

•  Scale linearly with the size of input •  Scale well with the number of dimensions •  The clustering result may be degraded at the

expense of simplicity of the method


Bad Cases for CLIQUE

Parts of a cluster may be missed

A cluster from CLIQUE may contain noise

Fuzzy Clustering

•  Each point xi takes a probability wij to belong to a cluster Cj

•  Requirements – For each point xi,

– For each cluster Cj

11

=∑=

k

jijw

mwm

iij <<∑

=1

0


Fuzzy C-Means (FCM)

Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij

Repeat Compute the centroid of each cluster using the fuzzy

pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij

Until the centroids do not change (or the change is below some threshold)


Critical Details

•  Optimization on sum of the squared error (SSE):

•  Computing centroids: •  Updating the fuzzy pseudo-partition

– When p=2

∑∑= =

=k

j

m

iji

pijk cxdistwCCSSE

1 1

21 ),(),,( …

∑∑==

=m

i

pij

m

ii

pijj wxwc

11

/

∑=

−−=k

q

pqi

pjiij cxdistcxdistw

1

11

211

2 )),(/1()),(/1(

∑=

=k

qqijiij cxdistcxdistw

1

22 ),(/1),(/1


Choice of P

•  When p à 1, FCM behaves like traditional k-means

•  When p is larger, the cluster centroids approach the global centroid of all data points

•  The partition becomes fuzzier as p increases


Effectiveness

Is a Clustering Good?

•  Feasibility – Applying any clustering methods on a uniformly

distributed data set is meaningless •  Quality

– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding

various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to

male or female is not meaningful


Major Tasks

•  Assessing clustering tendency – Are there non-random structures in the data?

•  Determining the number of clusters or other critical parameters

•  Measuring clustering quality


Uniformly Distributed Data

•  Clustering uniformly distributed data is meaningless

•  A uniformly distributed data set is generated by a uniform data distribution


504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

Figure 10.21: A data set that is uniformly distributed in the data space.

• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.

In the rest of this section, we discuss each of the above three topics.

10.6.1 Assessing Clustering Tendency

Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.

Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.

“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.

The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:

Hopkins Statistic

•  Hypothesis: the data is generated by a uniform distribution in a space

•  Sample n points, p1, …, pn, uniformly from the space of D

•  For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D


xi = minv2D

{dist(pi, v)}

Hopkins Statistic

•  Sample n points, q1, …, qn, uniformly from D •  For each qi, find the nearest neighbor of qi in

D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}

•  Calculate the Hopkins Statistic H


yi = minv2D,v 6=qi

{dist(qi, v)}

H =

nPi=1

yi

nPi=1

xi +nP

i=1yi

Explanation

•  If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5

•  If D is skewed, then would be substantially smaller, and thus H would be close to 0

•  If H > 0.5, then it is unlikely that D has statistically significant clusters


nX

i=1

yi

nX

i=1

xi

nX

i=1

yi

Finding the Number of Clusters

•  Depending on many factors – The shape and scale of the distribution in the

data set – The clustering resolution required by the user

•  Many methods exist – Set , each cluster has points on

average – Plot the sum of within-cluster variances with

respect to k, find the first (or the most significant turning point)


k =

rn

2

p2n

A Cross-Validation Method

•  Divide the data set D into m parts •  Use m – 1 parts to find a clustering •  Use the remaining part as the test set to test

the quality of the clustering – For each point in the test set, find the closest

centroid or cluster center – Use the squared distances between all points in the

test set and the corresponding centroids to measure how well the clustering model fits the test set

•  Repeat m times for each value of k, use the average as the quality measure


Measuring Clustering Quality

•  Ground truth: the ideal clustering determined by human experts

•  Two situations – There is a known ground truth – the extrinsic

(supervised) methods, comparing the clustering against the ground truth

– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated


Quality in Extrinsic Methods

•  Cluster homogeneity: the more pure the clusters in a clustering, the better the clustering

•  Cluster completeness: objects in the same cluster in the ground truth should be clustered together

•  Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag

•  Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one


Bcubed Precision and Recall

•  D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth

•  C is a clustering on D – C(oi) is the cluster-id of oi in C

•  For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise


Bcubed Precision and Recall

•  Precision

•  Recall


508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS

one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ = C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).

Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.

BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.

Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i = j), the correctness of the relation between oi and oj inclustering C is given by

Correctness(oi, oj) ={ 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)

0 otherwise.(10.28)

BCubed precision is defined as

Precision BCubed =

n∑

i=1

∑

oj :i=j,C(oi)=C(oj)

Correctness(oi, oj)

∥{oj|i = j, C(oi) = C(oj)}∥n

. (10.29)

10.6. EVALUATION OF CLUSTERING 509

BCubed recall is defined as

Recall BCubed =

n∑

i=1

∑

oj :i=j,L(oi)=L(oj)

Correctness(oi, oj)

∥{oj|i = j, L(oi) = L(oj)}∥n

. (10.30)

Intrinsic Methods

When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.

The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then

a(o) =

∑

o′∈Ci,o=o′ dist(o, o′)

|Ci|− 1(10.31)

and

b(o) = minCj :1≤j≤k,j =i

{

∑

o′∈Cjdist(o, o′)

|Cj |}. (10.32)

The silhouette coefficient of o is then defined as

s(o) =b(o)− a(o)

max{a(o), b(o)} . (10.33)

The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.

To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures

Silhouette Coefficient

•  No ground truth is assumed •  Suppose a data set D of n objects is partitioned

into k clusters, C1, …, Ck •  For each object o,

– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better

– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better


Silhouette Coefficient

•  Then

•  Use the average silhouette coefficient of all objects as the overall measure


b(o) = minCj :o 62Cj

{

Po

02Cj

dist(o, o0)

|Cj

| }

a(o) =

Po,o

02Ci,o0 6=o

dist(o, o0)

|Ci

|� 1

s(o) =

b(o)� a(o)

max{a(o), b(o)}

Classification

Jian Pei: CMPT 741/459 Classification (1) 293

Classification and Prediction

•  Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline)

•  Prediction: model continuous-valued functions – Predict the economic growth in 2015


Classification: A 2-step Process

•  Model construction: describe a set of predetermined classes –  Training dataset: tuples for model construction

•  Each tuple/sample belongs to a predefined class

–  Classification rules, decision trees, or math formulae

•  Model application: classify unseen objects –  Estimate accuracy of the model using an independent

test set –  Acceptable accuracy à apply the model to classify

tuples with unknown class labels


Model Construction

Training Data

Classification Algorithms

IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

Classifier (Model)

Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes

Dave Ass. Prof 6 No Anne Asso. Prof 3 No


Model Application

Classifier

Testing Data Unseen Data

(Jeff, Professor, 4)

Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No

Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes


Supervised/Unsupervised Learning

•  Supervised learning (classification) – Supervision: objects in the training data set have

labels – New data is classified based on the training set

•  Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc.

with the aim of establishing the existence of classes or clusters in the data


Data Preparation

•  Data cleaning – Preprocess data in order to reduce noise and

handle missing values •  Relevance analysis (feature selection)

– Remove the irrelevant or redundant attributes •  Data transformation

– Generalize and/or normalize data


Measurements of Quality

•  Prediction accuracy •  Speed and scalability

– Construction speed and application speed •  Robustness: handle noise and missing

values •  Scalability: build model for large training data

sets •  Interpretability: understandability of models


Decision Tree Induction

•  Decision tree representation •  Construction of a decision tree •  Inductive bias and overfitting •  Scalable enhancements for large databases


Decision Tree

•  A node in the tree – a test of some attribute •  A branch: a possible value of the attribute •  Classification

– Start at the root – Test the attribute – Move down the tree branch

Outlook

Sunny Overcast Rain

Humidity

High Normal

No Yes

Yes Wind

Strong Weak

No Yes


Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No

Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No

Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes

Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes

Rain Mild High Strong No


Appropriate Problems

•  Instances are represented by attribute-value pairs – Extensions of decision trees can handle real-

valued attributes •  Disjunctive descriptions may be required •  The training data may contain errors or

missing values


Basic Algorithm ID3

•  Construct a tree in a top-down recursive divide-and-conquer manner –  Which attribute is the best at the current node? –  Create a node for each possible attribute value –  Partition training data into descendant nodes

•  Conditions for stopping recursion –  All samples at a given node belong to the same class –  No attribute remained for further partitioning

•  Majority voting is employed for classifying the leaf

–  There is no sample at the node


Which Attribute Is the Best?

•  The attribute most useful for classifying examples

•  Information gain and gini index – Statistical properties – Measure how well an attribute separates the

training examples


Entropy

•  Measure homogeneity of examples

– S is the training data set, and pi is the proportion of S belong to class i

•  The smaller the entropy, the purer the data set

∑=

−≡c

iii ppSEntropy

12log)(


Information Gain

•  The expected reduction in entropy caused by partitioning the examples according to an attribute

∑∈

−≡)(

)(||||)(),(

AValuesvv

v SEntropySSSEntropyASGain

Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v


Example Outlook Temp Humid Wind PlayTenni

s Sunny Hot High Weak No Sunny Hot High Strong No




Rain Mild High Strong No 94.0145log

145

149log

149)( 22

=

−−=SEntropy

048.000.1146811.0

14894.0

)(146)(

148)(

)(||||)(),(

},{

=×−×−=

−−=

−= ∑∈

StrongWeak

StrongWeakvv

v

SEngropySEngropySEntropy

SEntropySSSEntropyWindSGain


Hypothesis Space Search in Decision Tree Building •  Hypothesis space: the set of possible

decision trees •  ID3: simple-to-complex, hill-climbing search

– Evaluation function: information gain


Capabilities and Limitations

•  The hypothesis space is complete •  Maintains only a single current hypothesis •  No backtracking

– May converge to a locally optimal solution •  Use all training examples at each step

– Make statistics-based decisions – Not sensitive to errors in individual example


Natural Bias

•  The information gain measure favors attributes with many values

•  An extreme example – Attribute “date” may have the highest

information gain – A very broad decision tree of depth one –  Inapplicable to any future data


Alternative Measures

•  Gain ratio: penalize attributes like date by incorporating split information – 

•  Split information is sensitive to how broadly and uniformly the attribute splits the data

–  •  Gain ratio can be undefined or very large

– Only test attributes with over average gain

||||log

||||),(

12 SS

SSASmationSplitInfor i

c

i

i∑=

−≡

),(),(),(

ASmationSplitInforASGainASGainRatio ≡


Measuring Inequality

Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree of inequality

Gini index

Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution


Gini Index (Adjusted)

•  A data set S contains examples from n classes

– pj is the relative frequency of class j in S •  A data set S is split into two subsets S1 and

S2 with sizes N1 and N2 respectively

•  The attribute provides the smallest ginisplit(T) is chosen to split the node

∑=

−=n

jp jTgini121)(

)()()( 22

11 Tgini

NNTgini

NNTginisplit +=

Extracting Classification Rules

•  Classification rules can be extracted from a decision tree

•  Each path from the root to a leaf à an IF-THEN rule – All attribute-value pair along a path form a

conjunctive condition – The leaf node holds the class prediction –  IF age = “<=30” AND student = “no” THEN

buys_computer = “no” •  Rules are easy to understand


Inductive Bias

•  The set of assumptions that, together with the training data, deductively justifies the classification to future instances – Preferences of the classifier construction

•  Shorter trees are preferred over longer trees •  Trees that place high information gain

attributes close to the root are preferred


Why Prefer Short Trees?

•  Occam’s razor: prefer the simplest hypothesis that fits the data

•  Fewer short trees than long trees •  A short tree is less likely to be a statistical

coincidence

“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony


Overfitting

•  A decision tree T may overfit the training data –  if there exists an alternative tree T’ such that T

has a higher accuracy than T’ over the training examples, but T’ has a higher accuracy than T over the entire distribution of data

•  Why overfitting? – Noise data – Bias in training data All data Training data

T T’


The Evaluation Issues

•  The accuracy of a classifier can be evaluated using a test data set – The test set is a part of the available labeled

data set •  But how can we evaluate the accuracy of a

classification method? – A classification method can generate many

classifiers •  What if the available labeled data set is too

small?


Holdout Method

•  Partition the available labeled data set into two disjoint subsets: the training set and the test set – 50-50 – 2/3 for training and 1/3 for testing

•  Build a classifier using the training set •  Evaluate the accuracy using the test set


Limitations of Holdout Method

•  Fewer labeled examples for training •  The classifier highly depends on the

composition of the training and test sets – The smaller the training set, the larger the

variance •  If the test set is too small, the evaluation is

not reliable •  The training and test sets are not

independent


Cross-Validation

•  Each record is used the same number of times for training and exactly once for testing

•  K-fold cross-validation –  Partition the data into k equal-sized subsets –  In each round, use one subset as the test set, and use

the rest subsets together as the training set –  Repeat k times –  The total error is the sum of the errors in k rounds

•  Leave-one-out: k = n –  Utilize as much data as possible for training –  Computationally expensive


Accuracy Can Be Misleading …

•  Consider a data set of 99% of the negative class and 1% of the positive class

•  A classifier predicts everything negative has an accuracy of 99%, though it does not work for the positive class at all!

•  Imbalance class distribution is popular in many applications – Medical applications, fraud detection, …


Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

Confusion matrix (contingency table, error matrix): used for imbalance class distribution


Performance Evaluation Matrix

PREDICTED CLASS

ACTUAL CLASS


True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)


Recall and Precision

•  Target class is more important than the other classes

PREDICTED CLASS

ACTUAL CLASS


Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)


Fallout

•  Type I errors – false positive: a negative object is classified as positive – Fallout: the type I error rate, FP / (TP + FP)

•  Type II errors – false negative: a positive object is classified as negative – Captured by recall

Fβ Measure

•  How can we summarize precision and recall into one metric? –  Using the harmonic mean between the two

•  Fβ measure

–  β = 0, Fβ is the precision –  β = ∞, Fβ is the recall –  0 < β < ∞, Fβ is a tradeoff between the precision and the

recall

FNFPTPTP

prrp

++=

+=

222(F) measure-F

Fβ =(β 2 +1)rpr +β 2p

=(β 2 +1)TP

(β 2 +1)TP +β 2FN +FP


Weighted Accuracy

•  A more general metric

dwcwbwawdwaw

4321

41Accuracy Weighted+++

+=

Measure w1 w2 w3 w4 Recall 1 1 0 0

Precision 1 0 1 0

Fβ β2 + 1 β2 1 0

Accuracy 1 1 1 1


ROC Curve

•  Receiver Operating Characteristic (ROC) 1-dimensional data set containing 2 classes. Any points located at x > t is classified as positive


ROC Curve (TP,FP): •  (0,0): declare everything

to be negative class •  (1,1): declare everything

to be positive class •  (1,0): ideal •  Diagonal line:

–  Random guessing –  Below diagonal line:

prediction is opposite of the true class Figure from [Tan, Steinbach, Kumar]


Comparing Two Classifiers

Figure from [Tan, Steinbach, Kumar]


Cost-Sensitive Learning

•  In some applications, misclassifying some classes may be disastrous – Tumor detection, fraud detection

•  Using a cost matrix PREDICTED CLASS

ACTUAL CLASS

Class=Yes Class=No Class=Yes -1 100 Class=No 1 0


Sampling for Imbalance Classes

•  Consider a data set containing 100 positive examples and 1,000 negative examples

•  Undersampling: use a random sample of 100 negative examples and all positive examples –  Some useful negative examples may be lost –  Run undersampling multiple times, use the ensemble of

multiple base classifiers –  Focused undersampling: remove negative samples that

are not useful for classification, e.g., those far away from the decision boundary


Oversampling

•  Replicate the positive examples until the training set has an equal number of positive and negative examples

•  For noisy data, may cause overfitting


Errors in Classification

•  Bias: the difference between the real class boundary and the decision boundary of a classification model

•  Variance: variability in the training data set •  Intrinsic noise in the target class: the target

class can be non-deterministic – instances with the same attribute values can have different class labels


One or More?

•  What if a medical doctor is not sure about a case? –  Joint-diagnosis: using a group of doctors carrying

different expertise –  Wisdom from crowd is often more accurate

•  All eager learning methods make prediction using a single classifier induced from training data –  A single classifier may have low confidence in some

cases •  Ensemble methods: construct a set of base

classifiers and take a vote on predictions in classification


Ensemble Classifiers Original

Training data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers C*(x)=Vote(C1(x), …, Ck(x))



Why May Ensemble Method Work?

•  Suppose there are two classes and each base classifier has an error rate of 35%

•  What if we use 25 base classifiers? –  If all base classifiers are identical, the ensemble

error rate is still 35% –  If base classifiers are independent, the

ensemble makes a wrong prediction only if more than half of the base classifiers are wrong

∑=

− =⎟⎟⎠

⎞⎜⎜⎝

⎛25

13

25 06.065.035.025

i

ii

i


Ensemble Error Rate



Ensemble Classifiers – When?

•  The base classifiers should be independent of each other

•  Each base classifier should do better than a classifier that performs random guessing


How to Construct Ensemble?

•  Manipulating the training set: derive multiple training sets and build a base classifier on each

•  Manipulating the input features: use only a subset of features in a base classifier

•  Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote

•  Manipulating the learning algorithm, e.g., using different network configuration in ANN


Bootstrap

•  Given an original training set T, derive a tranining set T’ by repeatedly uniformly sampling with replacement

•  If T has n tuples, each tuple has a probability p = 1 - (1 - 1/n)n of being selected in T’ – When n à ∞, p à 1 - 1/e ≈ 0.632

•  Use the tuples not in T’ as the test set


Bootstrap

•  Use a bootstrap sample as the training set, use the tuples not in the training set as the test set

•  .632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set

)368.0632.0(11

632. all

k

ibootstrap acck

acc ×+×= ∑ ε


Bagging •  Run bootstrap k times to obtain k base classifiers •  A test instance is assigned to the class that

receives the highest number of votes •  Strength: reduce the variance of base classifiers –

good for unstable base classifiers –  Unstable classifiers: sensitive to minor perturbations in

the training set, e.g., decision trees, associative classifiers, and ANN

•  For stable classifiers (e.g., linear discriminant analysis and kNN classifiers), bagging may even degrade the performance since the training sets are smaller

•  Less overfitting on noisy data


Boosting •  Assign a weight to each training example

–  Initially, each example is assigned a weight 1/n •  Weights can be used in one of the following ways

–  Weights as a sampling distribution to draw a set of bootstrap samples from the original training set

–  Weights used by a base classifier to learn a model biased towards heavier examples

•  Adaptively change the weight at the end of each boosting round –  The weight of an example correctly classified decreases –  The weight of an example incorrectly classified

increases •  Each round generates a base classifier


Critical Design Choices in Boosting

•  How the weights of the training examples are updated at the end of each boosting round?

•  How the predictions made by base classifiers are combined?


AdaBoost

•  Each base classifier carries an importance score related to its error rate – Error rate

– wi: weight, I(p) = 1 if p is true –  Importance score

( )∑=

≠=N

jjjiji yxCIw

N 1)(1

ε

⎟⎟⎠

⎞⎜⎜⎝

⎛ −=

i

ii ε

εα

1ln21


How Does Importance Score Work?


Weight Adjustment in AdaBoost

–  If any intermediate rounds generate an error rate more than 50%, the weights are reverted back to 1/n

•  The ensemble error rate is bounded

∑ =

⎪⎩

⎪⎨⎧

≠

==

+

−+

i

)1(

)()1(

1 factor,ion normalizat theis where

)( ifexp)( ifexp

jij

iij

iij

j

jij

i

wZ

yxCyxC

Zww

j

j

α

α

∏ −≤i

iiensemblee )1( εε


Intuition – Bayesian Classification

•  More hockey fans in Canada than in US –  Which country is Tom, a hockey ball fan, from? –  Predicting Canada has a better chance to be right

•  Prior probability P(Canadian)=5%: reflect background knowledge 5% of total population is Canadians

•  P(hockey fan | Canadian)=30%: the probability of a Canadian who is also a hockey fan

•  Posterior probability P(Canadian | hockey fan): the probability of a hockey fan is from Canada


Bayes Theorem

•  Find the maximum a posteriori (MAP) hypothesis

– Require background knowledge – Computational cost

)()()|()|(

DPhPhDPDhP =

)()|(max)()()|(max)|(max

hPhDPDP

hPhDPDhPh

Hh

HhHhMAP

∈

∈∈

=

=≡


Naïve Bayes Classifier

•  Assumption: attributes are independent •  Given a tuple (a1, a2, …, an), predict its

class as

–  : the value of x that maximizes f(x) •  Example:

∏=

=

jiji

i

iini

CaPCP

CPCaaaPC

)|()(maxarg

)()|,,,(maxarg 21 …

)(maxarg xf3maxarg 2

}3,2,1{−=

−∈x

x


Example: Training Dataset

Data sample X = (Outlook=sunny, Temp=mild, Humid=high Wind=weak) Will she play tennis? Yes

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No





P(Yes|X) = P(X|Yes) P(Yes) = 0.014 P(No|X) = P(X|No) P(No) = 0.007

Probability of Infrequent Values

•  (outlook = Sunny, temp = high, humid = low, wind = weak)?

•  P(humid = low) = 0


Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No





Smoothing

•  Suppose an attribute has n different values: a1, …, an

•  Assume a small enough value ε > 0 •  Let Pi be the frequency of ai,

Pi = # tuples having ai / total # of tuples •  Estimate


P (ai) = ✏+1� n✏

nPi

Characteristics of Naïve Bayes

•  Robust to isolated noise points – Such points are averaged out in probability

computation •  Insensitive to missing values •  Robust to irrelevant attributes

– Distributions on such attributes are almost uniform

•  Correlated attributes degrade the performance


Bayes Error Rate

•  The error rate of the ideal naïve Bayes classifier


Err =

xZ

0

P (Crocodile | X)dX +

1Z

x

P (Alligator | X)dX


Pros and Cons

•  Pros – Easy to implement – Good results obtained in many cases

•  Cons – A (too) strong assumption: independent

attributes •  How to handle dependent/correlated

attributes? – Bayesian belief networks


Associative Classification

•  Mine association possible rules (PR) in form of condset à c – Condset: a set of attribute-value pairs – C: class label

•  Build classifier – Organize rules according to decreasing

precedence based on confidence and support •  Classification

– Use the first matching rule to classify an unknown case


Associative Classification Methods

•  CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) –  Mine association possible rules in the form of

•  Cond-set (a set of attribute-value pairs) à class label

–  Build classifier: Organize rules according to decreasing precedence based on confidence and then support

•  CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) –  Classification: Statistical analysis on multiple rules


Instance-based Methods

•  Instance-based learning –  Store training examples and delay the processing until a

new instance must be classified (“lazy evaluation”) •  Typical approaches

–  K-nearest neighbor approach •  Instances represented as points in an Euclidean space

–  Locally weighted regression •  Construct local approximation

–  Case-based reasoning •  Use symbolic representations and knowledge-based inference


The K-Nearest Neighbor Method

•  Instances are points in an n-D space •  The k-nearest neighbors (KNN) in the

Euclidean distance – Return the most common value among the k

training examples nearest to the query point •  Discrete-/real-valued target functions

. _

+ _ xq

+

_ _ +

_

_

+


KNN Methods

•  For continuous-valued target functions, return the mean value of the k nearest neighbors

•  Distance-weighted nearest neighbor algorithm –  Give greater weights to closer neighbors

•  Robust to noisy data by averaging k-nearest neighbors

•  Curse of dimensionality –  Distance could be dominated by irrelevant attributes –  Axes stretch or elimination of the least relevant attributes

wd xq xi

≡ 12( , )


Lazy vs. Eager Learning

•  Efficiency: lazy learning uses less training time but more predicting time

•  Accuracy – Lazy method effectively uses a richer hypothesis

space – Eager: must commit to a single hypothesis that

covers the entire instance space

Outlier Detection

Motivation: Fraud Detection

Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 367

http://i.imgur.com/ckkoAOp.gif

Techniques: Fraud Detection

•  Features •  Dissimilarity •  Groups and noise


http://i.stack.imgur.com/tRDGU.png


Outlier Analysis

•  “One person’s noise is another person’s signal”

•  Outliers: the objects considerably dissimilar from the remainder of the data – Examples: credit card fraud, Michael Jordon,

intrusions, etc – Applications: credit card fraud detection, telecom

fraud detection, intrusion detection, customer segmentation, medical analysis, etc

Outliers and Noise

•  Different from noise – Noise is random error or variance in a measured

variable •  Outliers are interesting: an outlier violates

the mechanism that generates the normal data

•  Outlier detection vs. novelty detection – Early stage may be regarded as outliers – But later merged into the model


Types of Outliers

•  Three kinds: global, contextual and collective outliers – A data set may have multiple types of outlier – One object may belong to more than one type of

outlier •  Global outlier (or point anomaly)

– An outlier object significantly deviates from the rest of the data set

•  challenge: find an appropriate measurement of deviation


Contextual Outliers •  An outlier object deviates significantly based on a

selected context –  Ex. Is 10C in Vancouver an outlier? (depending on summer or

winter?) •  Attributes of data objects should be divided into two

groups –  Contextual attributes: defines the context, e.g., time & location –  Behavioral attributes: characteristics of the object, used in

outlier evaluation, e.g., temperature •  A generalization of local outliers—whose density

significantly deviates from its local area •  Challenge: how to define or formulate meaningful

context?


Collective Outliers

•  A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers – Application example: intrusion detection when a

number of computers keep sending denial-of-service packages to each other

•  Detection of collective outliers – Consider not only behavior of individual objects, but

also that of groups of objects – Need to have the background knowledge on the

relationship among data objects, such as a distance or similarity measure on objects


Outlier Detection: Challenges

•  Modeling normal objects and outliers properly – Hard to enumerate all possible normal behaviors in

an application – The border between normal and outlier objects is

often a gray area •  Application-specific outlier detection

– Choice of distance measure among objects and the model of relationship among objects are often application-dependent

– Example: clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations


Outlier Detection: Challenges

•  Handling noise in outlier detection – Noise may distort the normal objects and blur the

distinction between normal objects and outliers – Noise may help hide outliers and reduce the

effectiveness of outlier detection •  Understandability

– Understand why these are outliers: Justification of the detection

– Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism


Outlier Detection Methods

•  Whether user-labeled examples of outliers can be obtained – Supervised, semi-supervised, and unsupervised

methods •  Assumptions about normal data and outliers

– Statistical, proximity-based, and clustering-based methods


Supervised Methods •  Modeling outlier detection as a classification problem

–  Samples examined by domain experts used for training & testing •  Methods for Learning a classifier for outlier detection effectively:

–  Model normal objects & report those not matching the model as outliers, or

–  Model outliers and treat those not matching the model as normal •  Challenges

–  Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers

–  Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers)


Unsupervised Methods •  Assume the normal objects are somewhat

``clustered'‘ into multiple groups, each having some distinct features

•  An outlier is expected to be far away from any groups of normal objects

•  Weakness: Cannot detect collective outlier effectively –  Normal objects may not share any strong patterns, but

the collective outliers may share high similarity in a small area

•  Many clustering methods can be adapted for unsupervised methods –  Find clusters, then outliers: not belonging to any cluster


Unsupervised Methods: Challenges

•  In some intrusion or virus detection, normal activities are diverse – Unsupervised methods may have a high false

positive rate but still miss many real outliers. – Supervised methods can be more effective, e.g.,

identify attacking some key resources •  Challenges

– Hard to distinguish noise from outliers – Costly since first clustering: but far less outliers than

normal objects •  Newer methods: tackle outliers directly


Semi-Supervised Methods •  In many applications, the number of labeled data is often

small –  Labels could be on outliers only, normal objects only, or both

•  If some labeled normal objects are available –  Use the labeled examples and the proximate unlabeled

objects to train a model for normal objects –  Those not fitting the model of normal objects are detected as

outliers •  If only some labeled outliers are available, a small

number of labeled outliers many not cover the possible outliers well –  To improve the quality of outlier detection, one can get help

from models for normal objects learned from unsupervised methods


Pros and Cons

•  Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data

•  There are rich alternatives to use various statistical models – Parametric vs. non-parametric


Proximity-based Methods

•  An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set


Pros and Cons

•  The effectiveness of proximity-based methods highly relies on the proximity measure

•  In some applications, proximity or distance measures cannot be obtained easily

•  Often have a difficulty in identifying a group of outliers that stay close to each other

•  Two major types of proximity-based outlier detection methods – Distance-based vs. density-based


Clustering-based Methods

•  Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters


Challenges

•  Since there are many clustering methods, there are many clustering-based outlier detection methods as well

•  Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets



Statistical Outlier Analysis

•  Assumption: the objects in a data set are generated by a (stochastic) process (a generative model)

•  Learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers

•  two categories: parametric versus non-parametric

Example

•  Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model – The data not following the model are outliers.


Parametric Methods

•  Assumption: the normal data is generated by a parametric distribution with parameter θ

•  The probability density function of the parametric distribution f(x | θ) gives the probability that object x is generated by the distribution

•  The smaller this value, the more likely x is an outlier


Univariate Outliers Based on Normal Distribution

•  Taking derivatives with respect to µ and σ2, we derive the following maximum likelihood estimates


lnL(µ,�2) =nX

i=1

ln f(xi | (u,�2)) = �n

2ln(2⇡)� n

2ln�2 � 1

2�2

nX

i=1

(xi � µ)2

µ = x =1

n

nX

i=1

xi �

2 =1

n

nX

i=1

(xi � x)2

Example

•  Daily average temperature: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}

•  Since n = 10, •  Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is

an outlier since µ ± 3σ contains 99.7% data


µ = 28.61 � =p2.29 = 1.51

The Grubb’s Test

•  Maximum normed residual test •  For each object x in a data set, compute its

z-score – x is an outlier if

–  is the value taken by a t-distribution at a significance level of α/(2N), and N is the number of objects in the data set


z � N � 1pN

vuut t2↵2N ,N�2

N � 2 + t2↵2N ,N�2

t2↵2N ,N�2

Non-parametric Method

•  Not assume an a-priori statistical model, instead, determine the model from the input data – Not completely parameter free but consider the

number and nature of the parameters are flexible and not fixed in advance

•  Examples: histogram and kernel density estimation


Histogram

•  A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000


Challenges

•  Hard to choose an appropriate bin size for histogram – Too small bin size → normal objects in empty/

rare bins, false positive – Too big bin size → outliers in some frequent

bins, false negative


Proximity-based Outlier Detection

•  Objects far away from the others are outliers •  The proximity of an outlier deviates significantly

from that of most of the others in the data set •  Distance-based outlier detection: An object o is

an outlier if its neighborhood does not have enough other points

•  Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors



Depth-based Methods

•  Organize data objects in layers with various depths – The shallow layers are more likely to contain

outliers •  Example: Peeling, Depth contours •  Complexity O(N⎡k/2⎤) for k-d datasets

– Unacceptable for k>2


Depth-based Outliers: Example


Distance-based Outliers

•  A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lie at a distance greater than distance D from O

•  The larger D, the more outlying •  The larger p, the more outlying


Density-based Local Outlier

Both o1 and o2 are outliers Distance-based methods can detect o1, but not o2

Intuition

•  Outliers comparing to their local neighborhoods, instead of the global data distribution

•  The density around an outlier object is significantly different from the density around its neighbors

•  Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers


Classification-based Outlier Detection

•  Train a classification model that can distinguish “normal” data from outliers

•  A brute-force approach: Consider a training set that contains some samples labeled as “normal” and others labeled as “outlier” – A training set in practice is typically heavily

biased: the number of “normal” samples likely far exceeds that of outlier samples

– Cannot detect unseen anomaly


One-Class Model

•  A classifier is built to describe only the normal class •  Learn the decision boundary of the normal class

using classification methods such as SVM •  Any samples that do not belong to the normal class

(not within the decision boundary) are declared as outliers

•  Advantage: can detect new outliers that may not appear close to any outlier objects in the training set

•  Extension: Normal objects may belong to multiple classes


One-Class Model


Semi-Supervised Learning Methods

•  Combine classification-based and clustering-based methods

•  Method –  Use a clustering-based approach to find a large cluster,

C, and a small cluster, C1 –  Since some objects in C carry the label “normal”, treat all

objects in C as normal –  Use the one-class model of this cluster to identify normal

objects in outlier detection –  Since some objects in cluster C1 carry the label “outlier”,

declare all objects in C1 as outliers –  Any object that does not fall into the model for C (such

as a) is considered an outlier as well


Example


Pros and Cons

•  Pros: Outlier detection is fast •  Cons: Quality heavily depends on the availability

and quality of the training set, •  It is often difficult to obtain representative and high-

quality training data


Contextual Outliers •  An outlier object deviates significantly based on a

selected context –  Ex. Is 10C in Vancouver an outlier? (depending on summer or

winter?) •  Attributes of data objects should be divided into two

groups –  Contextual attributes: defines the context, e.g., time & location –  Behavioral attributes: characteristics of the object, used in

outlier evaluation, e.g., temperature •  A generalization of local outliers—whose density

significantly deviates from its local area •  Challenge: how to define or formulate meaningful

context?


Detection of Contextual Outliers

•  If the contexts can be clearly identified, transform it to conventional outlier detection –  Identify the context of the object using the

contextual attributes – Calculate the outlier score for the object in the

context using a conventional outlier detection method


Example

•  Detect outlier customers in the context of customer groups – Contextual attributes: age group, postal code – Behavioral attributes: the number of transactions per

year, annual total transaction amount •  Method

–  Locate c’s context; – Compare c with the other customers in the same

group; and – Use a conventional outlier detection method


Modeling Normal Behavior

•  Model the “normal” behavior with respect to contexts –  Use a training data set to train a model that predicts the

expected behavior attribute values with respect to the contextual attribute values

–  An object is a contextual outlier if its behavior attribute values significantly deviate from the values predicted by the model

•  Use a prediction model to link the contexts and behavior –  Avoid explicit identification of specific contexts –  Some possible methods: regression, Markov Models,

and Finite State Automaton …


Collective Outliers

•  Objects as a group deviate significantly from the entire data

•  Examine the structure of the data set, i.e, the relationships between multiple data objects – The structures are often not explicitly defined,

and have to be discovered as part of the outlier detection process.


Detecting High Dimensional Outliers

•  Interpretability of outliers –  Which subspaces manifest the outliers or an

assessment regarding the “outlying-ness” of the objects •  Data sparsity: data in high-D spaces are often sparse

–  The distance between objects becomes heavily dominated by noise as the dimensionality increases

•  Data subspaces –  Local behavior and patterns of data

•  Scalability with respect to dimensionality –  The number of subspaces increases exponentially


Angle-based Outliers


Introduction - Simon Fraser University...– The sample should be truly random • On a data set of...

Documents

Transcript of Introduction - Simon Fraser University...– The sample should be truly random • On a data set of...