Introduction - Simon Fraser University...– The sample should be truly random • On a data set of...
Transcript of Introduction - Simon Fraser University...– The sample should be truly random • On a data set of...
![Page 1: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/1.jpg)
Introduction
![Page 2: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/2.jpg)
Motivation: Business Intelligence
Jian Pei: CMPT 741/459 Data Mining -- Introduction 2
Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)
Product information (Product-id, category, manufacturer, made-in, stock-price, …)
Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)
Business queries:
![Page 3: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/3.jpg)
Techniques: Business Intelligence
• Multidimensional data analysis • Online query answering • Interactive data exploration
Jian Pei: CMPT 741/459 Data Mining -- Introduction 3
![Page 4: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/4.jpg)
Motivation: Store Layout Design
Jian Pei: CMPT 741/459 Data Mining -- Introduction 4
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
![Page 5: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/5.jpg)
Techniques: Store Layout Design
• Customer purchase patterns • Business strategies
Jian Pei: CMPT 741/459 Data Mining -- Introduction 5
![Page 6: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/6.jpg)
Motivation: Community Detection
Jian Pei: CMPT 741/459 Data Mining -- Introduction 6
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811
![Page 7: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/7.jpg)
Techniques: Community Detection
• Similarity between objects • Partitioning objects into groups
– No guidance about what a group is
Jian Pei: CMPT 741/459 Data Mining -- Introduction 7
![Page 8: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/8.jpg)
Motivation: Disease Prediction
Jian Pei: CMPT 741/459 Data Mining -- Introduction 8
Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …
What medical problems does this patient has?
![Page 9: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/9.jpg)
Techniques: Disease Prediction
• Features • Model
Jian Pei: CMPT 741/459 Data Mining -- Introduction 9
![Page 10: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/10.jpg)
Motivation: Fraud Detection
Jian Pei: CMPT 741/459 Data Mining -- Introduction 10
http://i.imgur.com/ckkoAOp.gif
![Page 11: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/11.jpg)
Techniques: Fraud Detection
• Features • Dissimilarity • Groups and noise
Jian Pei: CMPT 741/459 Data Mining -- Introduction 11
http://i.stack.imgur.com/tRDGU.png
![Page 12: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/12.jpg)
What Is Data Science About?
• Data • Extraction of knowledge from data • Continuation of data mining and knowledge
discovery from data (KDD)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 12
![Page 13: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/13.jpg)
What Is Data?
• Values of qualitative or quantitative variables belonging to a set of items
• Represented in a structure, e.g., tabular, tree or graph structure
• Typically the results of measurements • As an abstract concept can be viewed as the
lowest level of abstraction from which information and then knowledge are derived
Jian Pei: CMPT 741/459 Data Mining -- Introduction 13
![Page 14: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/14.jpg)
What Is Information?
• “Knowledge communicated or received concerning a particular fact or circumstance”
• Conceptually, information is the message (utterance or expression) being conveyed
• Cannot be predicted • Can resolve uncertainty
Jian Pei: CMPT 741/459 Data Mining -- Introduction 14
![Page 15: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/15.jpg)
What Is Knowledge?
• Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education
• Implicit knowledge: practical skill or expertise • Explicit knowledge: theoretical
understanding of a subject
Jian Pei: CMPT 741/459 Data Mining -- Introduction 15
![Page 16: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/16.jpg)
Data Systems
• A data system answers queries based on data acquired in the past
• Base data – the rawest data not derived from anywhere else
• Knowledge – information derived from the base data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 16
![Page 17: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/17.jpg)
Dealing with Data – Querying
• Given a set of student records about name, age, courses taken and grades
• Simple queries – What is John Doe’s age?
• Aggregate queries – What is the average GPA of all students at this
school? • Queries can be arbitrarily complicated
– Find the students X and Y whose grades are less than 3% apart in as many courses as possible
Jian Pei: CMPT 741/459 Data Mining -- Introduction 17
![Page 18: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/18.jpg)
Queries
• A precise request for information • Subjects in databases and information
retrieval – Databases: structured queries on structured
(e.g., relational) data – Information retrieval: unstructured queries on
unstructured (e.g., text, image) data • Important assumptions
– Information needs – Query languages
Jian Pei: CMPT 741/459 Data Mining -- Introduction 18
![Page 19: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/19.jpg)
Data-driven Exploration
• What should be the next strategy of a company? – A lot of data: sales, human resource, production,
tax, service cost, … • The question cannot be translated into a
precise request for information (i.e., a query) • Developing familiarity (knowledge) and
actionable items (decisions) by interactively analyzing data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 19
![Page 20: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/20.jpg)
Data-driven Thinking
• Starting with some simple queries • New queries are raised by consuming the
results of previous queries • No ultimate query in design!
– But many queries can be answered using DB/IR techniques
Jian Pei: CMPT 741/459 Data Mining -- Introduction 20
![Page 21: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/21.jpg)
The Art of Data-driven Thinking
• The way of generating queries remains an art! – Different people may derive different results
using the same data
“If you torture the data long enough, it will confess” – Ronald H. Coase
• More often than not, more data may be needed – datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction 21
![Page 22: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/22.jpg)
Queries for Data-driven Thinking
• Probe queries – finding information about specific individuals
• Aggregation – finding information about groups • Pattern finding – finding commonality in
population • Association and correlation – finding
connections among individuals and groups • Causality analysis – finding causes and
consequences
Jian Pei: CMPT 741/459 Data Mining -- Introduction 22
![Page 23: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/23.jpg)
What Is Data Mining?
• Broader sense: the art of data-driven thinking
• Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of
queries in the data mining process in the broader sense
Jian Pei: CMPT 741/459 Data Mining -- Introduction 23
![Page 24: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/24.jpg)
Machine Learning
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”
– Tom M. Mitchell • Essentially, learn the distribution of data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 24
![Page 25: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/25.jpg)
Data mining vs. Machine Learning
• Machine learning focuses on prediction, based on known properties learned from the training data
• Data mining focuses on the discovery of (previously) unknown properties on the data
Jian Pei: CMPT 741/459 Data Mining -- Introduction 25
![Page 26: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/26.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 26
The KDD Process
Data
Target data
Preprocessed data
Transformed data
Patterns
Knowledge
Selection Preprocessing
Transformation
Data mining
Interpretation/evaluation
![Page 27: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/27.jpg)
Data Mining R&D
• New problem identification • Data collection and transformation • Algorithm design and implementation • Evaluation
– Effectiveness evaluation – Efficiency & scalability evaluation
• Deployment and business solution
Jian Pei: CMPT 741/459 Data Mining -- Introduction 27
![Page 28: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/28.jpg)
Data Mining on Big Data
“Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it”
– Hal Varian, Google’s Chief Economist
Jian Pei: CMPT 741/459 Data Mining -- Introduction 28
![Page 29: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/29.jpg)
What Is Big Data?
• No quantitative definition! • “Big data is like teenage sex
– everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...”
– Dan Ariely
Jian Pei: CMPT 741/459 Data Mining -- Introduction 29
![Page 30: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/30.jpg)
Data Volume vs. Storage Cost
• The unit cost of disk storage decreases dramatically
Jian Pei: CMPT 741/459 Data Mining -- Introduction 30
Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB
http://ns1758.ca/winch/winchest.html
![Page 31: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/31.jpg)
Big Data – Volume
“Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time”
— Wikipedia
Jian Pei: CMPT 741/459 Data Mining -- Introduction 31
![Page 32: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/32.jpg)
Big Data: Volume
• Every day, about 7 billion shares change hands on US equity markets – About 2/3 is traded by computer algorithms based
on huge amounts of data to predict gains and risk • In Q2 2015
– Facebook has 1.49 billion active users – Wechat has 600 million active users, 100 million
outside China – LinkedIn has 380 million active users – Twitter has 304 active users
Jian Pei: CMPT 741/459 Data Mining -- Introduction 32
![Page 33: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/33.jpg)
Velocity
• Google processes 24+ petabytes of data per day
• Facebook gets 10+ million new photos uploaded every hour
• Facebook members like or leave a comment 3+ billion times per day
• YouTube users upload 1+ hour of video every second
• 400+ million tweets per day
Jian Pei: CMPT 741/459 Data Mining -- Introduction 33
![Page 34: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/34.jpg)
What Has Been Changed?
• The 1880 census in the US took 8 years to complete – The 1890 census would need 13 years – using
punch cards, it was reduced to less than 1 year • It is essential to get not only the accurate but
also the timely data – Statisticians use sampling to estimate
• Recently, with the new technologies, the ways of data collection and transmission have been fundamentally changed
Jian Pei: CMPT 741/459 Data Mining -- Introduction 34
![Page 35: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/35.jpg)
Sampling for Volume/Velocity?
• Sampling idea: the marginal new information brought by larger amount of data shrinks quickly – The sample should be truly random
• On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories of attribute combinations – Finding outliers and exceptions
• Big data contains signals of different strengths – No noise, instead weaker and weaker, but still may
be interesting and important signals
Jian Pei: CMPT 741/459 Data Mining -- Introduction 35
![Page 36: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/36.jpg)
Big Data – Leytro Pictures
• Lytro pictures record the whole light field – Photographers can decide later which parts to
focus on • Big data tries to record as much information
as possible – Analysts can decide later what to extract from
big data – Both advantages and challenges
Jian Pei: CMPT 741/459 Data Mining -- Introduction 36
![Page 37: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/37.jpg)
Veracity
• “1 in 3 business leaders don't trust the information they use to make decisions”
• Assuming a slowly growing total cost budget, tradeoff between data volume and data quality
• Loss of veracity in combining different types of information from different sources
• Loss of veracity in data extraction, transformation, and processing
Jian Pei: CMPT 741/459 Data Mining -- Introduction 37
![Page 38: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/38.jpg)
Variety
• Integrating data capturing different aspects of a data object – Vancouver Canucks: game video, technical
statistics, social media, … – Different pieces are in different format
• Different views of the same data object from different sources – Did the soccer ball pass the goal line? – The views may not be consistent
Jian Pei: CMPT 741/459 Data Mining -- Introduction 38
![Page 39: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/39.jpg)
Four V-challenges
• Volume: massive scale and growth, 40% per year in global data generated
• Velocity: real time data generation and consumption
• Variety: heterogeneous data, mainly unstructured or semi-structured, from many sources
• Veracity
Jian Pei: CMPT 741/459 Data Mining -- Introduction 39
![Page 40: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/40.jpg)
Is Big Data Really New?
• People were aware of the existence of big data long time ago, but no one can access it until very recently – (Genesis 28:15) “I am with you and will watch
over you wherever you go” – “密室私语,天闻如雷;暗室欺⼼,神目如电;善恶之报,如影随⾏”
– Similar statements in Quran and Sutra • What has been changed?
– How is data connected with people
Jian Pei: CMPT 741/459 Data Mining -- Introduction 40
![Page 41: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/41.jpg)
Diversity in Data Usage
• In the past, only very few projects can afford to be data-intensive
• Nowadays, excessive applications are (naturally) data-intensive
Jian Pei: CMPT 741/459 Data Mining -- Introduction 41
![Page 42: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/42.jpg)
Datafication
• Extract data about an object or event in a quantified way so that it can be analyzed – Different from digitalization
• An important feature of big data • Key: new data, new applications, new
opportunities
Jian Pei: CMPT 741/459 Data Mining -- Introduction 42
![Page 43: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/43.jpg)
New Values of Datafication
• Example: Captcha and ReCaptcha (Luis von Ahn)
• How to create new values of data and datafication? – Connecting data with new users – Connecting different pieces of data to present a
bigger picture • Important techniques
– Data aggregation – Extended datafication
Jian Pei: CMPT 741/459 Data Mining -- Introduction 43
![Page 44: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/44.jpg)
Big Data Players
• Data holders • Data specialists • Big-data mindset leaders • A capable company may play 2 or 3 roles at
the same time • What is most important, big-data mindset,
skills, or data itself?
Jian Pei: CMPT 741/459 Data Mining -- Introduction 44
![Page 45: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/45.jpg)
Privacy
• “… big data analytics have the potential to eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the marketplace”
— Executive Office of the (US) President
Jian Pei: CMPT 741/459 Data Mining -- Introduction 45
![Page 46: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/46.jpg)
Keep in Mind
“Our industry does not respect tradition – it only respects innovation.”
– Satya Nadella
Jian Pei: CMPT 741/459 Data Mining -- Introduction 46
![Page 47: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/47.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 47
Goals of This Course
• Data-driven thinking – towards being a (big) data scientist
• Principles and hands-on skills of data mining, particularly in the context of big data – Identifying new data mining problems – Data mining algorithm design – Data mining applications
• Novel problems for upcoming research
![Page 48: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/48.jpg)
Format
• Due to the fast progress in data mining, we will go beyond the textbook substantially
• Active classroom discussion • Open questions and brainstorming • Textbook: Data Mining – Concepts and
Techniques (3rd ed)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 48
![Page 49: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/49.jpg)
Read – Try – Think
• Reading – (required) Textbook and a small number of research
papers – You have to have the 3rd ed of the textbook! – (open end, not covered by the exam) Technical and
non-technical materials • Trying
– Assignments and a project • Thinking
– Examine everything from a data scientist angle from today
Jian Pei: CMPT 741/459 Data Mining -- Introduction 49
![Page 50: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/50.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 50
Data Mining: History
• 1989 IJCAI Workshop on Knowledge Discovery in Databases – Knowledge Discovery in Databases (G.
Piatetsky-Shapiro and W. Frawley, 1991) • 91-94 Workshops on Knowledge
Discovery in Databases – Advances in Knowledge Discovery and
Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
![Page 51: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/51.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Introduction 51
Data Mining: History (cont’d)
• 95-98 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining
(2001), (IEEE) ICDM (2001), etc. • ACM Transactions on KDD starting in 2007
![Page 52: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/52.jpg)
Frequent Pattern Mining
![Page 53: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/53.jpg)
How Many Words Is a Picture Worth?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 53
E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
![Page 54: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/54.jpg)
Burnt or Burned?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 54
E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013
![Page 55: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/55.jpg)
Store Layout Design
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 55
http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg
![Page 56: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/56.jpg)
Transaction Data
• Alphabet: a set of items – Example: all products sold in a store
• A transaction: a set of items involved in an activity – Example: the items purchased by a customer in
a visit • Other information is often associated
– Timestamp, price, salesperson, customer-id, store-id, …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 56
![Page 57: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/57.jpg)
Examples of Transaction Data
• • • • •
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 57
![Page 58: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/58.jpg)
How to Store Transaction Data?
• Transaction-id (t123, a, b, c) (t236, b, d)
• Relational storage • Transaction-based storage • Item-based (vertical) storage
– Item a: …, t123, … – Item b: …, t123, …, t236, … – …
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 58
Tid Item t123 a t123 b t123 c … … t236 b t236 d
![Page 59: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/59.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 59
Transaction Data Analysis
• Transactions: customers’ purchases of commodities – {bread, milk, cheese} if they are bought together
• Frequent patterns: product combinations that are frequently purchased together by customers
• Frequent patterns: patterns (set of items, sequence, etc.) that occur frequently in a database [AIS93]
![Page 60: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/60.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 60
Why Frequent Patterns?
• What products were often purchased together?
• What are the frequent subsequent purchases after buying a iPod?
• What kinds of genes are sensitive to this new drug?
• What key-word combinations are frequently associated with web pages about game-evaluation?
![Page 61: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/61.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 61
Why Frequent Pattern Mining?
• Foundation for many data mining tasks – Association rules, correlation, causality,
sequential patterns, spatial and multimedia patterns, associative classification, cluster analysis, iceberg cube, …
• Broad applications – Basket data analysis, cross-marketing, catalog
design, sale campaign analysis, web log (click stream) analysis, …
![Page 62: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/62.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 62
Frequent Itemsets
• Itemset: a set of items – E.g., acm = {a, c, m}
• Support of itemsets – Sup(acm) = 3
• Given min_sup = 3, acm is a frequent pattern
• Frequent pattern mining: finding all frequent patterns in a database
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n
Transaction database TDB
![Page 63: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/63.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 63
A Naïve Attempt
• Generate all possible itemsets, test their supports against the database
• How to hold a large number of itemsets into main memory? – 100 items à 2100 – 1 possible itemets
• How to test the supports of a huge number of itemsets against a large database, say containing 100 million transactions? – A transaction of length 20 needs to update the
support of 220 – 1 = 1,048,575 itemsets
![Page 64: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/64.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 64
Transactions in Real Applications
• A large department store often carries more than 100 thousand different kinds of items – Amazon.com carries more than 17,000 books
relevant to data mining • Walmart has more than 20 million
transactions per day, AT&T produces more than 275 million calls per day
• Mining large transaction databases of many items is a real demand
![Page 65: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/65.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 65
How to Get an Efficient Method?
• Reducing the number of itemsets that need to be checked
• Checking the supports of selected itemsets efficiently
![Page 66: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/66.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 66
Candidate Generation & Test
• Any subset of a frequent itemset must also be frequent – an anti-monotonic property – A transaction containing {beer, diaper, nuts} also
contains {beer, diaper} – {beer, diaper, nuts} is frequent à {beer, diaper} must
also be frequent • In other words, any superset of an infrequent
itemset must also be infrequent – No superset of any infrequent itemset should be
generated or tested – Many item combinations can be pruned!
![Page 67: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/67.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 67
Apriori-Based Mining
• Generate length (k+1) candidate itemsets from length k frequent itemsets, and
• Test the candidates against DB
![Page 68: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/68.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 68
The Apriori Algorithm [AgSr94]
TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2
Itemset Sup a 2 b 3 c 3 d 1 e 3
Data base D 1-candidates
Scan D
Itemset Sup a 2 b 3 c 3 e 3
Freq 1-itemsets Itemset
ab ac ae bc be ce
2-candidates
Itemset Sup ab 1 ac 2 ae 1 bc 2 be 3 ce 2
Counting
Scan D
Itemset Sup ac 2 bc 2 be 3 ce 2
Freq 2-itemsets Itemset
bce
3-candidates
Itemset Sup bce 2
Freq 3-itemsets
Scan D
![Page 69: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/69.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 69
The Apriori Algorithm Level-wise, candidate generation and test • Ck: Candidate itemset of size k • Lk : frequent itemset of size k
• L1 = {frequent items}; • for (k = 1; Lk !=∅; k++) do
– Ck+1 = candidates generated from Lk; – for each transaction t in database do increment the
count of all candidates in Ck+1 that are contained in t – Lk+1 = candidates in Ck+1 with min_support
• return ∪k Lk;
Candidate generation
Test
![Page 70: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/70.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 70
Important Steps in Apriori
• How to find frequent 1- and 2-itemsets? • How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning • How to count supports of candidates?
![Page 71: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/71.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 71
Finding Frequent 1- & 2-itemsets
• Finding frequent 1-itemsets (i.e., frequent items) using a one dimensional array – Initialize c[item]=0 for each item – For each transaction T, for each item in T,
c[item]++; – If c[item]>=min_sup, item is frequent
• Finding frequent 2-itemsets using a 2-dimensional triangle matrix – For items i, j (i<j), c[i, j] is the count of itemset ij
![Page 72: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/72.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 72
Counting Array
• A 2-dimensional triangle matrix can be implemented using a 1-dimensional array
1 2 3 4 5 1 1 2 3 4 2 5 6 7 3 8 9 4 10 5
There are n items For items i, j (i>j), c[i,j] = c[(i-1)(2n-i)/2+j-i]; Example: c[3,5] =c[(3-1)*(2*5-3)/2+5-3]=c[9]
1 2 3 4 5 6 7 8 9 10
![Page 73: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/73.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 73
Example of Candidate-generation
• L3 = {abc, abd, acd, ace, bcd} • Self-joining: L3*L3
– abcd ß abc * abd – acde ß acd * ace
• Pruning: – acde is removed because ade is not in L3
• C4={abcd}
![Page 74: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/74.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 74
How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-join Lk-1
INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
• Step 2: pruning – For each itemset c in Ck do
• For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck
![Page 75: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/75.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 75
How to Count Supports?
• Why is counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates
• Method – Candidate itemsets are stored in a hash-tree – A leaf node of hash-tree contains a list of itemsets and
counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a
transaction
![Page 76: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/76.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 76
Example: Counting Supports
1,4,7 2,5,8
3,6,9 Subset function
2 3 4 5 6 7
1 4 5 1 3 6
1 2 4 4 5 7 1 2 5
4 5 8 1 5 9
3 4 5 3 5 6 3 5 7 6 8 9
3 6 7 3 6 8
Transaction: 1 2 3 5 6
1 + 2 3 5 6
1 2 + 3 5 6
1 3 + 5 6
![Page 77: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/77.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 77
Association Rules
• Rule c à am • Support: 3 (i.e., the support
of acm) • Confidence: 75% (i.e.,
sup(acm) / sup(c)) • Given a minimum support
threshold and a minimum confidence threshold, find all association rules whose support and confidence passing the thresholds
TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n
Transaction database TDB
![Page 78: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/78.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 78
Challenges of Freq Pat Mining
• Multiple scans of transaction database • Huge number of candidates • Tedious workload of support counting for
candidates
![Page 79: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/79.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 79
Improving Apriori: Ideas
• Reducing the number of transaction database scans
• Shrinking the number of candidates • Facilitating support counting of candidates
![Page 80: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/80.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 80
Bottleneck of Freq Pattern Mining
• Multiple database scans are costly • Mining long patterns needs many scans and
generates many candidates – To find frequent itemset i1i2…i100
• # of scans: 100 • # of Candidates:
– Bottleneck: candidate-generation-and-test • Can we avoid candidate generation?
30100 1027.112100100
2100
1100
×≈−=⎟⎟⎠
⎞⎜⎜⎝
⎛++⎟⎟
⎠
⎞⎜⎜⎝
⎛+⎟⎟⎠
⎞⎜⎜⎝
⎛!
![Page 81: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/81.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 81
Search Space of Freq. Pat. Mining
• Itemsets form a lattice ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Itemset lattice
![Page 82: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/82.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 82
Set Enumeration Tree
• Use an order on items, enumerate itemsets in lexicographic order – a, ab, abc, abcd, ac, acd, ad, b, bc, bcd, bd, c, dc, d
• Reduce a lattice to a tree ∅
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
Set enumeration tree
![Page 83: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/83.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 83
Borders of Frequent Itemsets
• Frequent itemsets are connected – ∅ is trivially frequent – X on the border à every subset of X is frequent
∅
a b c d
ab ac ad bc bd cd
abc abd acd bcd
abcd
![Page 84: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/84.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 84
Projected Databases
• To test whether Xy is frequent, we can use the X-projected database – The sub-database of transactions containing X – Check whether item y is frequent in X-projected
database ∅
a b c d ab ac ad bc bd cd
abc abd acd bcd
abcd
![Page 85: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/85.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 85
Compress Database by FP-tree • The 1st scan: find
frequent items – Only record frequent
items in FP-tree – F-list: f-c-a-b-m-p
• The 2nd scan: construct tree – Order frequent items in
each transaction w.r.t. f-list
– Explore sharing among transactions
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Header table item
f c a b m p
TID Items bought (ordered) freq items
100 f, a, c, d, g, I, m, p f, c, a, m, p
200 a, b, c, f, l,m, o f, c, a, b, m
300 b, f, h, j, o f, b
400 b, c, k, s, p c, b, p
500 a, f, c, e, l, p, m, n f, c, a, m, p
![Page 86: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/86.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 86
Benefits of FP-tree
• Completeness – Never break a long pattern in any transaction – Preserve complete information for freq pattern mining
• Not scan database anymore
• Compactness – Reduce irrelevant info — infrequent items are removed – Items in frequency descending order (f-list): the more
frequently occurring, the more likely to be shared – Never be larger than the original database (not counting
node-links and the count fields)
![Page 87: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/87.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 87
Partitioning Frequent Patterns
• Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p – Patterns containing p – Patterns having m but no p – … – Patterns having c but no a nor b, m, or p – Pattern f
• Depth-first search of a set enumeration tree – The partitioning is complete and does not have
any overlap
![Page 88: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/88.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 88
• Only transactions containing p are needed • Form p-projected database
– Starting at entry p of the header table – Follow the side-link of frequent item p – Accumulate all transformed prefix paths of p
Find Patterns Having Item “p”
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Header table item
f c a b m p
p-projected database TDB|p fcam: 2 cb: 1
Local frequent item: c:3 Frequent patterns containing p
p: 3, pc: 3
![Page 89: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/89.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 89
Find Pat Having Item m But No p
• Form m-projected database TDB|m – Item p is excluded (why?) – Contain fca:2, fcab:1 – Local frequent items: f, c, a
• Build FP-tree for TDB|m root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Header table item
f c a b m p
Header table item
f c a
root f:3 c:3 a:3
m-projected FP-tree
![Page 90: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/90.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 90
Recursive Mining
• Patterns having m but no p can be mined recursively
• Optimization: enumerate patterns from a single-branch FP-tree – Enumerate all combination – Support = that of the last item
• m, fm, cm, am • fcm, fam, cam • fcam
Header table item
f c a
root
f:3
c:3
a:3
m-projected FP-tree
![Page 91: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/91.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 91
Enumerate Patterns From Single Prefix of FP-tree • A (projected) FP-tree has a single prefix
– Reduce the single prefix into one node – Join the mining results of the two parts
Ú a2:n2
a3:n3
a1:n1
root
b1:m1 c1:k1
c2:k2 c3:k3
+ a2:n2
a3:n3
a1:n1
root
r =
r1
b1:m1 c1:k1
c2:k2 c3:k3
![Page 92: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/92.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 92
FP-growth
• Pattern-growth: recursively grow frequent patterns by pattern and database partitioning
• Algorithm – For each frequent item, construct its projected database,
and then its projected FP-tree – Repeat the process on each newly created projected
FP-tree – Until the resulted FP-tree is empty, or contains only one
path – single path generates all the combinations, each of which is a frequent pattern
![Page 93: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/93.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 93
Scaling up by DB Projection
• What if an FP-tree cannot fit into memory? • Database projection
– Partition a database into a set of projected databases
– Construct and mine FP-tree once the projected database can fit into main memory
• Heuristic: Projected database shrinks quickly in many applications
![Page 94: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/94.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 94
Parallel vs. Partition Projection
• Parallel projection: form all projected database at a time
• Partition projection: propagate projections
Tran. DB fcamp fcabm fb cbp fcamp
p-proj DB fcam cb fcam
m-proj DB fcab fca fca
b-proj DB f cb …
a-proj DB fc …
c-proj DB f …
f-proj DB …
am-proj DB fc fc fc
cm-proj DB f f f
…
![Page 95: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/95.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 95
Why Is FP-growth Efficient?
• Divide-and-conquer strategy – Decompose both the mining task and DB – Lead to focused search of smaller databases
• Other factors – No candidate generation nor candidate test – Database compression using FP-tree – No repeated scan of entire database – Basic operations – counting local frequent items
and building FP-tree, no pattern search nor pattern matching
![Page 96: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/96.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (2) 96
Major Costs in FP-growth
• Poor locality of FP-trees – Low hit rate of cache
• Building FP-trees – A stack of FP-trees
• Redundant information – Transaction abcd appears in a-, ab-, abc-, ac-, …, c- projected databases and FP-trees
![Page 97: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/97.jpg)
Effectiveness of Freq Pat Mining
• Too many patterns! – A pattern a1a2…an contains 2n-1 subpatterns – Understanding many patterns is difficult or even
impossible for human users • Non-focused mining
– A manager may be only interested in patterns involving some items (s)he manages
– A user is often interested in patterns satisfying some constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 97
![Page 98: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/98.jpg)
Itemset Lattice ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Tid transaction
10 ABD
20 ABC
30 AD
40 ABCD
50 CD
Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD, ACD
Min_sup=2
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 98
![Page 99: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/99.jpg)
Max-Patterns ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{}
Tid transaction
10 ABD
20 ABC
30 AD
40 ABCD
50 CD
Length Frequent itemsets 1 A, B, C, D 2 AB, AC, AD, BC, BD, CD 3 ABC, ABD
Min_sup=2
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 99
![Page 100: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/100.jpg)
Borders and Max-patterns
• Max-patterns: borders of frequent patterns – Any subset of max-pattern is frequent – Any superset of max-pattern is infrequent – Cannot generate rules ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
{} Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 100
![Page 101: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/101.jpg)
Patterns and Support Counts ABCD
ABC:2 ABD:2 ACD BCD
AB:3 AC:2 BC:2 AD:3 BD:2 CD:2
A:4 B:4 C:3 D:4
{}
Tid transaction
10 ABD
20 ABC
30 AD
40 ABCD
50 CD
Len Frequent itemsets 1 A:4, B:4, C:3, D:4 2 AB:3, AC:2, AD:3, BC:3, BD:2, CD:2 3 ABC:2, ABD:2
Min_sup=2
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 101
![Page 102: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/102.jpg)
Frequent Closed Patterns
• For frequent itemset X, if there exists no item y not in X s.t. every transaction containing X also contains y, then X is a frequent closed pattern – “acdf” is a frequent closed pattern
• Concise rep. of freq pats – Can generate non-redundant rules
• Reduce # of patterns and rules • N. Pasquier et al. In ICDT’99
TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50 c, e, f
Min_sup=2
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 102
![Page 103: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/103.jpg)
Closed and Max-patterns
• Closed pattern mining algorithms can be adapted to mine max-patterns – A max-pattern must be closed
• Depth-first search methods have advantages over breadth-first search ones – Why?
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 103
![Page 104: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/104.jpg)
Constraint-based Data Mining
• Find all the patterns in a database autonomously? – The patterns could be too many but not focused!
• Data mining should be interactive – User directs what to be mined
• Constraint-based mining – User flexibility: provides constraints on what to be mined – System optimization: push constraints for efficient mining
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 104
![Page 105: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/105.jpg)
Constraints in Data Mining
• Knowledge type constraint – classification, association, etc.
• Data constraint — using SQL-like queries – find product pairs sold together in stores in New York
• Dimension/level constraint – in relevance to region, price, brand, customer category
• Rule (or pattern) constraint – small sales (price < $10) triggers big sales (sum >$200)
• Interestingness constraint – strong rules: support and confidence
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 105
![Page 106: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/106.jpg)
Constrained Mining vs. Search
• Constrained mining vs. constraint-based search – Both aim at reducing search space – Finding all patterns vs. some (or one) answers satisfying
constraints – Constraint-pushing vs. heuristic search – An interesting research problem on integrating both
• Constrained mining vs. DBMS query processing – Database query processing requires to find all – Constrained pattern mining shares a similar philosophy
as pushing selections deeply in query processing
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 106
![Page 107: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/107.jpg)
Optimization
• Mining frequent patterns with constraint C – Sound: only find patterns satisfying the constraints C – Complete: find all patterns satisfying the constraints C
• A naïve solution – Constraint test as a post-processing
• More efficient approaches – Analyze the properties of constraints – Push constraints as deeply as possible into frequent
pattern mining
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 107
![Page 108: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/108.jpg)
Anti-Monotonicity
• Anti-monotonicity – An intemset S violates the constraint, so does
any of its superset – sum(S.Price) ≤ v is anti-monotone – sum(S.Price) ≥ v is not anti-monotone
• Example – C: range(S.profit) ≤ 15 – Itemset ab violates C – So does every superset of ab
TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g
TDB (min_sup=2)
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 108
![Page 109: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/109.jpg)
Anti-monotonic Constraints Constraint Antimonotone
v ∈ S No S ⊆ V no S ⊆ V yes
min(S) ≤ v no min(S) ≥ v yes max(S) ≤ v yes max(S) ≥ v no
count(S) ≤ v yes count(S) ≥ v no
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no
range(S) ≤ v yes range(S) ≥ v no
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ yes support(S) ≤ ξ no
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 109
![Page 110: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/110.jpg)
Monotonicity
• Monotonicity – An intemset S satisfies the constraint, so does
any of its superset – sum(S.Price) ≥ v is monotone – min(S.Price) ≤ v is monotone
• Example – C: range(S.profit) ≥ 15 – Itemset ab satisfies C – So does every superset of ab
TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g
TDB (min_sup=2)
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 110
![Page 111: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/111.jpg)
Monotonic Constraints Constraint Monotone
v ∈ S yes S ⊆ V yes S ⊆ V no
min(S) ≤ v yes min(S) ≥ v no max(S) ≤ v no max(S) ≥ v yes
count(S) ≤ v no count(S) ≥ v yes
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) no sum(S) ≥ v ( a ∈ S, a ≥ 0 ) yes
range(S) ≤ v no range(S) ≥ v yes
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible support(S) ≥ ξ no support(S) ≤ ξ yes
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 111
![Page 112: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/112.jpg)
Converting “Tough” Constraints
• Convert tough constraints into anti-monotone or monotone by properly ordering items
• Examine C: avg(S.profit) ≥ 25 – Order items in value-descending order
• <a, f, g, d, b, h, c, e>
– If an itemset afb violates C • So does afbh, afb* • It becomes anti-monotone!
TID Transaction 10 a, b, c, d, f 20 b, c, d, f, g, h 30 a, c, d, e, f 40 c, e, f, g
TDB (min_sup=2)
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 112
![Page 113: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/113.jpg)
Convertible Constraints
• Let R be an order of items • Convertible anti-monotone
– If an itemset S violates a constraint C, so does every itemset having S as a prefix w.r.t. R
– Ex. avg(S) ≤ v w.r.t. item value descending order • Convertible monotone
– If an itemset S satisfies constraint C, so does every itemset having S as a prefix w.r.t. R
– Ex. avg(S) ≥ v w.r.t. item value descending order
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 113
![Page 114: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/114.jpg)
Strongly Convertible Constraints
• avg(X) ≥ 25 is convertible anti-monotone w.r.t. item value descending order R: <a, f, g, d, b, h, c, e> – Itemset af violates a constraint C, so does
every itemset with af as prefix, such as afd • avg(X) ≥ 25 is convertible monotone
w.r.t. item value ascending order R-1: <e, c, h, b, d, g, f, a> – Itemset d satisfies a constraint C, so does
itemsets df and dfa, which having d as a prefix
• Thus, avg(X) ≥ 25 is strongly convertible
Item Profit a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 114
![Page 115: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/115.jpg)
Convertible Constraints
Constraint Convertible anti-monotone
Convertible monotone
Strongly convertible
avg(S) ≤ , ≥ v Yes Yes Yes
median(S) ≤ , ≥ v Yes Yes Yes
sum(S) ≤ v (items could be of any value, v ≥ 0) Yes No No
sum(S) ≤ v (items could be of any value, v ≤ 0) No Yes No
sum(S) ≥ v (items could be of any value, v ≥ 0) No Yes No
sum(S) ≥ v (items could be of any value, v ≤ 0) Yes No No
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 115
![Page 116: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/116.jpg)
Can Apriori Handle Convertible Constraint? • A convertible, not monotone nor anti-
monotone nor succinct constraint cannot be pushed deep into the an Apriori mining algorithm – Within the level wise framework, no direct
pruning based on the constraint can be made – Itemset df violates constraint C: avg(X)>=25 – Since adf satisfies C, Apriori needs df to
assemble adf, df cannot be pruned • But it can be pushed into frequent-pattern
growth framework!
Item Value a 40 b 0 c -20 d 10 e -30 f 30 g 20 h -10
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 116
![Page 117: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/117.jpg)
Mining With Convertible Constraints • C: avg(S.profit) ≥ 25 • List of items in every transaction in
value descending order R: – <a, f, g, d, b, h, c, e> – C is convertible anti-monotone w.r.t. R
• Scan transaction DB once – remove infrequent items
• Item h in transaction 40 is dropped
– Itemsets a and f are good
TID Transaction 10 a, f, d, b, c 20 f, g, d, b, c 30 a, f, d, c, e 40 f, g, h, c, e
TDB (min_sup=2)
Item Profit a 40 f 30 g 20 d 10 b 0 h -10 c -20 e -30
Jian Pei: CMPT 741/459 Frequent Pattern Mining (3) 117
![Page 118: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/118.jpg)
Not Every Pattern Is Interesting!
• Trivial patterns – Pregnant à Female 100% confidence
• Misleading patterns – Play basketball à eat cereal [40%, 66.7%]
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 118
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
![Page 119: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/119.jpg)
Evaluation Criteria
• Objective interestingness measures – Examples: support, patterns formed by mutually
independent items – Domain independent
• Subjective measures – Examples: domain knowledge, templates/
constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 119
![Page 120: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/120.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 120
Correlation and Lift
• P(B|A)/P(B) is called the lift of rule A à B
• Play basketball à eat cereal (lift: 0.89) • Play basketball à not eat cereal (lift: 1.33)
corrA,B =P(A∪B)P(A)P(B)
=P(AB)
P(A)P(B)
Basketball Not basketball Sum (row)
Cereal 2000 1750 3750
Not cereal 1000 250 1250
Sum(col.) 3000 2000 5000
Contingency table 372 Chapter 6 Association Analysis
Table 6.7. A 2-way contingency table for variables A and B.
B B
A f11 f10 f1+
A f01 f00 f0+
f+1 f+0 N
counts tabulated in a contingency table. Table 6.7 shows an example of acontingency table for a pair of binary variables, A and B. We use the notationA (B) to indicate that A (B) is absent from a transaction. Each entry fij inthis 2 × 2 table denotes a frequency count. For example, f11 is the number oftimes A and B appear together in the same transaction, while f01 is the num-ber of transactions that contain B but not A. The row sum f1+ representsthe support count for A, while the column sum f+1 represents the supportcount for B. Finally, even though our discussion focuses mainly on asymmet-ric binary variables, note that contingency tables are also applicable to otherattribute types such as symmetric binary, nominal, and ordinal variables.
Limitations of the Support-Confidence Framework Existing associa-tion rule mining formulation relies on the support and confidence measures toeliminate uninteresting patterns. The drawback of support was previously de-scribed in Section 6.8, in which many potentially interesting patterns involvinglow support items might be eliminated by the support threshold. The draw-back of confidence is more subtle and is best demonstrated with the followingexample.
Example 6.3. Suppose we are interested in analyzing the relationship be-tween people who drink tea and coffee. We may gather information about thebeverage preferences among a group of people and summarize their responsesinto a table such as the one shown in Table 6.8.
Table 6.8. Beverage preferences among a group of 1000 people.
Coffee Coffee
Tea 150 50 200
Tea 650 150 800
800 200 1000
![Page 121: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/121.jpg)
Property of Lift
• If A and B are independent, lift = 1 • If A and B are positively correlated, lift > 1 • If A and B are negatively correlated, lift < 1 • Limitation: lift is sensitive to P(A) and P(B)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 121
374 Chapter 6 Association Analysis
Table 6.9. Contingency tables for the word pairs ({p,q} and {r,s}.
p p r r
q 880 50 930 s 20 50 70
q 50 20 70 s 50 880 930
930 70 1000 70 930 1000
This equation follows from the standard approach of using simple fractionsas estimates for probabilities. The fraction f11/N is an estimate for the jointprobability P (A, B), while f1+/N and f+1/N are the estimates for P (A) andP (B), respectively. If A and B are statistically independent, then P (A, B) =P (A) × P (B), thus leading to the formula shown in Equation 6.6. UsingEquations 6.5 and 6.6, we can interpret the measure as follows:
I(A, B)
⎧⎨
⎩
= 1, if A and B are independent;> 1, if A and B are positively correlated;< 1, if A and B are negatively correlated.
(6.7)
For the tea-coffee example shown in Table 6.8, I = 0.150.2×0.8 = 0.9375, thus sug-
gesting a slight negative correlation between tea drinkers and coffee drinkers.
Limitations of Interest Factor We illustrate the limitation of interestfactor with an example from the text mining domain. In the text domain, itis reasonable to assume that the association between a pair of words dependson the number of documents that contain both words. For example, becauseof their stronger association, we expect the words data and mining to appeartogether more frequently than the words compiler and mining in a collectionof computer science articles.
Table 6.9 shows the frequency of occurrences between two pairs of words,{p, q} and {r, s}. Using the formula given in Equation 6.5, the interest factorfor {p, q} is 1.02 and for {r, s} is 4.08. These results are somewhat troublingfor the following reasons. Although p and q appear together in 88% of thedocuments, their interest factor is close to 1, which is the value when p and qare statistically independent. On the other hand, the interest factor for {r, s}is higher than {p, q} even though r and s seldom appear together in the samedocument. Confidence is perhaps the better choice in this situation because itconsiders the association between p and q (94.6%) to be much stronger thanthat between r and s (28.6%).
lift(p, q) < lift(r, s)!
![Page 122: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/122.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 122
From Itemsets to Sequences
• Itemsets: combinations of items, no temporal order • Temporal order is important in many situations
– Time-series databases and sequence databases – Frequent patterns à (frequent) sequential patterns
• Applications of sequential pattern mining – Customer shopping sequences:
• First buy computer, then iPod, and then digital camera, within 3 months.
– Medical treatment, natural disasters, science and engineering processes, stocks and markets, telephone calling patterns, Web log clickthrough streams, DNA sequences and gene structures
![Page 123: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/123.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 123
What Is Sequential Pattern Mining?
• Given a set of sequences, find the complete set of frequent subsequences
A sequence database A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
![Page 124: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/124.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 124
Challenges in Seq Pat Mining
• A huge number of possible sequential patterns are hidden in databases
• A mining algorithm should – Find the complete set of patterns satisfying the
minimum support (frequency) threshold – Be highly efficient, scalable, involving only a
small number of database scans – Be able to incorporate various kinds of user-
specific constraints
![Page 125: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/125.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 125
Apriori Property of Seq Patterns
• Apriori property in sequential patterns – If a sequence S is infrequent, then none of the
super-sequences of S is frequent – E.g, <hb> is infrequent à so do <hab> and
<(ah)b>
Given support threshold min_sup =2
Seq-id Sequence
10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
![Page 126: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/126.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 126
GSP
• GSP (Generalized Sequential Pattern) mining • Outline of the method
– Initially, every item in DB is a candidate of length-1 – For each level (i.e., sequences of length-k) do
• Scan database to collect support count for each candidate sequence
• Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori
– Repeat until no frequent sequence or no candidate can be found
• Major strength: Candidate pruning by Apriori
![Page 127: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/127.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 127
Finding Len-1 Seq Patterns
• Initial candidates – <a>, <b>, <c>, <d>, <e>, <f>, <g>,
<h> • Scan database once
– count support for candidates
min_sup =2
Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1
Seq-id Sequence
10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
![Page 128: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/128.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 128
Generating Length-2 Candidates <a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff>
<a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f>
51 length-2 Candidates
Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes
44.57% candidates
![Page 129: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/129.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 129
Finding Len-2 Seq Patterns
• Scan database one more time, collect support count for each length-2 candidate
• There are 19 length-2 candidates which pass the minimum support threshold – They are length-2 sequential patterns
![Page 130: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/130.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 130
Generating Length-3 Candidates and Finding Length-3 Patterns • Generate Length-3 Candidates
– Self-join length-2 sequential patterns • <ab>, <aa> and <ba> are all length-2 sequential
patterns à <aba> is a length-3 candidate • <(bd)>, <bb> and <db> are all length-2 sequential
patterns à <(bd)b> is a length-3 candidate – 46 candidates are generated
• Find Length-3 Sequential Patterns – Scan database once more, collect support
counts for candidates – 19 out of 46 candidates pass support threshold
![Page 131: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/131.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 131
The GSP Mining Process
<a> <b> <c> <d> <e> <f> <g> <h>
<aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
<abb> <aab> <aba> <baa> <bab> …
<abba> <(bd)bc> …
<(bd)cba>
1st scan: 8 cand. 6 length-1 seq. pat.
2nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all
3rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all
4th scan: 8 cand. 6 length-4 seq. pat.
5th scan: 1 cand. 1 length-5 seq. pat.
Cand. cannot pass sup. threshold
Cand. not in DB at all
min_sup =2
Seq-id Sequence 10 <(bd)cb(ac)> 20 <(bf)(ce)b(fg)> 30 <(ah)(bf)abf> 40 <(be)(ce)d> 50 <a(bd)bcb(ade)>
![Page 132: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/132.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 132
The GSP Algorithm
• Take sequences in form of <x> as length-1 candidates
• Scan database once, find F1, the set of length-1 sequential patterns
• Let k=1; while Fk is not empty do – Form Ck+1, the set of length-(k+1) candidates from Fk; – If Ck+1 is not empty, scan database once, find Fk+1, the
set of length-(k+1) sequential patterns – Let k=k+1;
![Page 133: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/133.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 133
Bottlenecks of GSP
• A huge set of candidates – 1,000 frequent length-1 sequences generate
length-2 candidates! • Multiple scans of database in mining • Real challenge: mining long sequential
patterns – An exponential number of short candidates – A length-100 sequential pattern needs 1030
candidate sequences!
500,499,12999100010001000 =
×+×
30100100
11012
100≈−=⎟⎟
⎠
⎞⎜⎜⎝
⎛∑=i i
![Page 134: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/134.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 134
FreeSpan: Freq Pat-projected Sequential Pattern Mining • The itemset of a seq pat must be frequent
– Recursively project a sequence database into a set of smaller databases based on the current set of frequent patterns
– Mine each projected database to find its patterns f_list: b:5, c:4, a:3, d:3, e:3, f:2
All seq. pat. can be divided into 6 subsets: • Seq. pat. containing item f • Those containing e but no f • Those containing d but no e nor f • Those containing a but no d, e or f • Those containing c but no a, d, e or f • Those containing only item b
Sequence Database SDB < (bd) c b (ac) > < (bf) (ce) b (fg) > < (ah) (bf) a b f > < (be) (ce) d > < a (bd) b c b (ade) >
![Page 135: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/135.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 135
From FreeSpan to PrefixSpan
• Freespan: – Projection-based: no candidate sequence needs
to be generated – But, projection can be performed at any point in
the sequence, and the projected sequences may not shrink much
• PrefixSpan – Projection-based – But only prefix-based projection: less projections
and quickly shrinking sequences
![Page 136: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/136.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 136
Prefix and Suffix (Projection)
• <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>
• Given sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <ab> <(_c)(ac)d(cf)>
![Page 137: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/137.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 137
Mining Sequential Patterns by Prefix Projections • Step 1: find length-1 sequential patterns
– <a>, <b>, <c>, <d>, <e>, <f> • Step 2: divide search space. The complete
set of seq. pat. can be partitioned into 6 subsets: – The ones having prefix <a>; – The ones having prefix <b>; – … – The ones having prefix <f>
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
![Page 138: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/138.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 138
Finding Seq. Pat. with Prefix <a>
• Only need to consider projections w.r.t. <a> – <a>-projected database: <(abc)(ac)d(cf)>,
<(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> • Find all the length-2 seq. pat. having prefix
<a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af> – Further partition into 6 subsets
• Having prefix <aa>; • … • Having prefix <af>
SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
![Page 139: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/139.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 139
Completeness of PrefixSpan
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
SDB
Length-1 sequential patterns <a>, <b>, <c>, <d>, <e>, <f>
<a>-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>
Length-2 sequential patterns <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>
Having prefix <a>
Having prefix <aa>
<aa>-proj. db … <af>-proj. db
Having prefix <af>
<b>-projected database … Having prefix <b>
Having prefix <c>, …, <f>
… …
![Page 140: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/140.jpg)
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 140
Efficiency of PrefixSpan
• No candidate sequence needs to be generated
• Projected databases keep shrinking • Major cost of PrefixSpan: constructing
projected databases – Can be improved by bi-level projections
![Page 141: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/141.jpg)
Effectiveness
• Redundancy due to anti-monotonicity – {<abcd>} leads to 15 sequential patterns of
same support – Closed sequential patterns and sequential
generators • Constraints on sequential patterns
– Gap – Length – More sophisticated, application oriented
constraints
Jian Pei: CMPT 741/459 Frequent Pattern Mining (4) 141
![Page 142: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/142.jpg)
Data Warehousing & OLAP
![Page 143: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/143.jpg)
Motivation: Business Intelligence
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 143
Customer information (customer-id, gender, age, home-address, occupation, income, family-size, …)
Product information (Product-id, category, manufacturer, made-in, stock-price, …)
Sales information (customer-id, product-id, #units, unit-price, sales-representative, …)
Business queries: • Which categories of products are most popular for customers in Vancouver? • Find pairs (customer groups, most popular products)
![Page 144: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/144.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 144
Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat …
In what aspect is he most similar to cases of coronary artery disease
and, at the same time, dissimilar to adiposity?
![Page 145: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/145.jpg)
Don’t You Ever Google Yourself?
• Big data makes one know oneself better • 57% American adults search themselves on
Internet – Good news: those people are
better paid than those who haven’t done so! (Investors.com)
• Egocentric analysis becomes more and more important with big data
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 145
![Page 146: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/146.jpg)
Egocentric Analysis
• How am I different from (more often than not, better than) others?
• In what aspects am I good?
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 146
http://img03.deviantart.net/a670/i/2010/219/a/e/glee___egocentric_by_gleeondoodles.jpg
![Page 147: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/147.jpg)
Dimensions
• “An aspect or feature of a situation, problem, or thing, a measurable extent of some kind”
– Dictionary • Dimensions/attributes are used to model
complex objects in a divide-and-conquer manner – Objects are compared in selected dimensions/
attributes • More often than not, objects have too many
dimensions/attributes than one is interested in and can handle
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 147
![Page 148: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/148.jpg)
Multi-dimensional Analysis
• Find interesting patterns in multi-dimensional subspaces – “Michael Jordan is outstanding in subspaces (total
points, total rebounds, total assists) and (number of games played, total points, total assists)”
• Different patterns may be manifested in different subspaces – Feature selection (machine learning and statistics):
select a subset of relevant features for use in model construction – a set of features for all objects
– Different subspaces may manifest different patterns
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 148
![Page 149: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/149.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 149
OLAP
• Conceptually, we may explore all possible subspaces for interesting patterns
• What patterns are interesting? • How can we explore all possible subspaces
systematically and efficiently? • Fundamental problems in analytics and data
mining
![Page 150: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/150.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 150
OLAP
• Aggregates and group-bys are frequently used in data analysis and summarization SELECT time, altitude, AVG(temp) FROM weather GOUP BY time, altitude; – In TPC, 6 standard benchmarks have 83 queries,
aggregates are used 59 times, group-bys are used 20 times
• Online analytical processing (OLAP): the techniques that answer multi-dimensional analytical (MDA) queries efficiently
![Page 151: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/151.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 151
OLAP Operations
• Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction – (Day, Store, Product type, SUM(sales) à
(Month, City, *, SUM(sales)) • Drill down (roll down): reverse of roll-up,
from higher level summary to lower level summary or detailed data, or introducing new dimensions
![Page 152: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/152.jpg)
Roll Up
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 152
http://www.tutorialspoint.com/dwh/images/rollup.jpg
![Page 153: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/153.jpg)
Drill Down
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 153
http://www.tutorialspoint.com/dwh/images/drill_down.jpg
![Page 154: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/154.jpg)
Other Operations
• Dice: pick specific values or ranges on some dimensions
• Pivot: “rotate” a cube – changing the order of dimensions in visual analysis
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 154
http://en.wikipedia.org/wiki/File:OLAP_pivoting.png
![Page 155: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/155.jpg)
Dice
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 155
http://www.tutorialspoint.com/dwh/images/dice.jpg
![Page 156: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/156.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 156
Relational Representation
• If there are n dimensions, there are 2n possible aggregation columns
Roll up by model by year by color in a table
![Page 157: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/157.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 157
Difficulties
• Many group bys are needed – 6 dimensions à 26=64 group bys
• In most SQL systems, the resulting query needs 64 scans of the data, 64 sorts or hashes, and a long wait!
![Page 158: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/158.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 158
Dummy Value “ALL”
![Page 159: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/159.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 159
CUBE
SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39
DATA CUBE Model Year Color Sales
CUBE
Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941
SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in {'Ford', 'Chevy'} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE(Model, Year, Color);
![Page 160: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/160.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 160
Semantics of ALL
• ALL is a set – Model.ALL = ALL(Model) = {Chevy, Ford } – Year.ALL = ALL(Year) = {1990,1991,1992} – Color.ALL = ALL(Color) = {red,white,blue}
![Page 161: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/161.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 161
OLTP Versus OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support DB design application-oriented subject-oriented
data current, up-to-date, detailed, flat relational Isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write, index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed
tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
![Page 162: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/162.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 162
What Is a Data Warehouse?
• “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”
– W. H. Inmon • Data warehousing: the process of
constructing and using data warehouses
![Page 163: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/163.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 163
Subject-Oriented
• Organized around major subjects, such as customer, product, sales
• Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing
• Providing a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process
![Page 164: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/164.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 164
Integrated
• Integrating multiple, heterogeneous data sources – Relational databases, flat files, on-line transaction
records • Data cleaning and data integration
– Ensuring consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted
![Page 165: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/165.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 165
Time Variant
• The time horizon for the data warehouse is significantly longer than that of operational systems – Operational databases: current value data – Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse contains
an element of time, explicitly or implicitly – But the key of operational data may or may not contain “time element”
![Page 166: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/166.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 166
Nonvolatile
• A physically separate store of data transformed from the operational environment
• Operational updates of data do not occur in the data warehouse environment – Do not require transaction processing, recovery,
and concurrency control mechanisms – Require only two operations in data accessing
• Initial loading of data • Access of data
![Page 167: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/167.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (1) 167
Why Separate Data Warehouse?
• High performance for both – Operational DBMS: tuned for OLTP – Warehouse: tuned for OLAP
• Different functions and different data – Historical data: data analysis often uses
historical data that operational databases do not typically maintain
– Data consolidation: data analysis requires consolidation (aggregation, summarization) of data from heterogeneous sources
![Page 168: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/168.jpg)
Data Warehouse Schema Design
• Query answering efficiency – Subject orientation – Integration
• Tradeoff between time and space – Universal table versus fully normalized schema
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 168
![Page 169: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/169.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 169
Star Schema
time_key day day_of_the_week month quarter year
time
location_key street city state_or_province country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
![Page 170: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/170.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 170
Snowflake Schema
time_key day day_of_the_week month quarter year
time
location_key street city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key item_name brand type supplier_key
item
branch_key branch_name branch_type
branch
supplier_key supplier_type
supplier
city_key city state_or_province country
city
![Page 171: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/171.jpg)
Fact Constellation
time_key day day_of_the_week month quarter year
time
location_key street city province_or_state country
location
Sales Fact Table
time_key item_key branch_key location_key units_sold dollars_sold avg_sales
Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
Shipping Fact Table
time_key item_key shipper_key from_location
to_location dollars_cost units_shipped
shipper_key shipper_name location_key shipper_type
shipper
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 171
![Page 172: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/172.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 172
(Good) Aggregate Functions
• Distributive: there is a function G() such that F({Xi,j}) = G({F({Xi,j |i=1,...,lj}) | j=1,…n}) – Examples: COUNT(), MIN(), MAX(), SUM() – G=SUM() for COUNT()
• Algebraic: there is an M-tuple valued function G() and a function H() such that F({Xi,j}) = H({G({Xi,j |i=1,.., I}) | j=1,..., n }) – Examples: AVG(), standard deviation, MaxN(), MinN() – For AVG(), G() records sum and count, H() adds these
two components and divides to produce the global average
![Page 173: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/173.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 173
Holistic Aggregate Functions
• There is no constant bound on the size of the storage needed to describe a sub-aggregate. – There is no constant M, such that an M-tuple
characterizes the computation F({Xi,j |i=1,...,I}).
• Examples: Median(), MostFrequent() (also called the Mode()), and Rank()
![Page 174: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/174.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 174
Index Requirements in OLAP
• Data is read only – (Almost) no insertion or deletion
• Query types – Point query: looking up one specific tuple (rare) – Range query: returning the aggregate of a
(large) set of tuples, with group by – Complex queries: need specific algorithms and
index structures, will be discussed later
![Page 175: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/175.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 175
OLAP Query Example
• In table (cust, gender, …), find the total number of male customers
• Method 1: scan the table once • Method 2: build a B+ tree index on attribute
gender, still need to access all tuples of male customers
• Can we get the count without scanning many tuples, even not all tuples of male customers?
![Page 176: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/176.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 176
Bitmap Index
• For n tuples, a bitmap index has n bits and can be packed into ⎡n /8⎤ bytes and ⎡n /32⎤ words
• From a bit to the row-id: the j-th bit of the p-th byte à row-id = p*8 +j cust gender …
Jack M … Cathy F … … … …
Nancy F …
1 0 … 0
![Page 177: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/177.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 177
Using Bitmap to Count
• Shcount[] contains the number of bits in the entry subscript – Example: shcount[01100101]=4 count = 0; for (i = 0; i < SHNUM; i++) count += shcount[B[i]];
![Page 178: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/178.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 178
Advantages of Bitmap Index
• Efficient in space • Ready for logic composition
– C = C1 AND C2 – Bitmap operations can be used
• Bitmap index only works for categorical data with low cardinality – Naively, we need 50 bits per entry to represent
the state of a customer in US – How to represent a sale in dollars?
![Page 179: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/179.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 179
Bit-Sliced Index
• A sale amount can be written as an integer number of pennies, and then be represented as a binary number of N bits – 24 bits is good for up to $167,772.15,
appropriate for many stores • A bit-sliced index is N bitmaps
– Tuple j sets in bitmap k if the k-th bit in its binary representation is on
– The space costs of bit-sliced index is the same as storing the data directly
![Page 180: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/180.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 180
Using Indexes
SELECT SUM(sales) FROM Sales WHERE C; – Tuples satisfying C is identified by a bitmap B
• Direct access to rows to calculate SUM: scan the whole table once
• B+ tree: find the tuples from the tree • Projection index: only scan attribute sales • Bit-sliced index: get the sum from ∑(B AND
Bk)*2k
![Page 181: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/181.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 181
Cost Comparison
• Traditional value-list index (B+ tree) is costly in both I/O and CPU time – Not good for OLAP
• Bit-sliced index is efficient in I/O • Other case studies in [O’Neil and Quass,
SIGMOD’97]
![Page 182: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/182.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 182
Horizontal or Vertical Storage
• A fact table for data warehousing is often fat – Tens of even hundreds of dimensions/attributes
• A query is often about only a few attributes • Horizontal storage: tuples are stored one by one • Vertical storage: tuples are stored by attributes
A1 A2 … A100
x1 x2 … x100
… … … … z1 z2 … z100
A1 A2 … A100
x1 x2 … x100
… … … … z1 z2 … z100
![Page 183: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/183.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (2) 183
Horizontal Versus Vertical
• Find the information of tuple t – Typical in OLTP – Horizontal storage: get the whole tuple in one search – Vertical storage: search 100 lists
• Find SUM(a100) GROUP BY {a22, a83} – Typical in OLAP – Horizontal storage (no index): search all tuples O(100n),
where n is the number of tuples – Vertical storage: search 3 lists O(3n), 3% of the
horizontal storage method • Projection index: vertical storage
![Page 184: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/184.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 184
MOLAP
Date
Cou
ntry
sum
sum TV
VCR PC
1Qtr 2Qtr 3Qtr 4Qtr U.S.A
Canada
Mexico
sum
![Page 185: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/185.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 185
Pros and Cons
• Easy to implement • Fast retrieval • Many entries may be empty if data is sparse • Costly in space
![Page 186: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/186.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 186
ROLAP – Data Cube in Table
• A multi-dimensional database Base table
Dimensions Measure Store Product Season AVG(Sales)
S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9 S1 * Spring 9 … … … … * * * 9
Dimensions Measure Store Product Season Sales
S1 P1 Spring 6 S1 P2 Spring 12 S2 P1 Fall 9
Cubing
![Page 187: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/187.jpg)
Data Cube: A Lattice of Cuboids
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 187
time,item
time,item,location
time, item, location, supplierc
all
time item location supplier
time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,supplier
time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
![Page 188: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/188.jpg)
Data Cube: A Lattice of Cuboids
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 188
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells (9/15, milk, Urbana, Dairy_land), (9/15, milk, Urbana, *), (*, milk, Urbana, *), (*, milk, Urbana, *) (*, milk, Chicago, *), (*, milk, *, *)
all
time,item
time,item,location
time, item, location, supplier
time item location supplier
time,location time,supplier
item,location item,supplier
location,supplier
time,item,supplier time,location,supplier
item,location,supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
![Page 189: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/189.jpg)
Full Cube vs. Iceberg Cube
• Full cube vs. iceberg cube compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min support
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 189
n Avoidexplosivegrowth:Acubewith100dimensionsn 2basecells:(a1,a2,….,a100),(b1,b2,…,b100)n Howmanyaggregatecellsif“havingcount>=1”?n Whatabout“havingcount>=2”?
iceberg condition
![Page 190: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/190.jpg)
Multi-Way Array Aggregation
• Array-based “bottom-up” algorithm
• Using multi-dimensional chunks • No direct tuple comparisons • Simultaneous aggregation on
multiple dimensions • Intermediate aggregate values
are re-used for computing ancestor cuboids
• Cannot do Apriori pruning: No iceberg optimization
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 190
ABC
AB
A
All
B
AC BC
C
![Page 191: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/191.jpg)
Multi-way Array Aggregation for Cube Computation (MOLAP)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 191
• Partition arrays into chunks (a small subcube which fits in memory). • Compressed sparse array addressing: (chunk_id, offset) • Compute aggregates in “multiway” by visiting cube cells in the order which
minimizes the # of times to visit each cell, and reduces memory access & storage cost.
What is the best traversing order to do multi-way aggregation?
A
B 29 30 31 32
1 2 3 4
5
9
13 14 15 16
6463626148474645
a1 a0
c3 c2
c1 c 0
b3
b2
b1
b0 a2 a3
C
B
4428 5640
24 523620
60
![Page 192: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/192.jpg)
Multi-way Array Aggregation for Cube Computation (3-D to 2-D)
all
A B
AB
ABC
AC BC
C
• The best order is the one that minimizes the memory requirement and reduced I/Os
ABC
AB
A
All
B
AC BC
C
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 192
![Page 193: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/193.jpg)
Multi-way Array Aggregation for Cube Computation (2-D to 1-D)
ABC
AB
A
All
B
AC BC
C
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 193
![Page 194: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/194.jpg)
Multi-Way Array Aggregation for Cube Computation • Method: the planes should be sorted and
computed according to their size in ascending order – Idea: keep the smallest plane in the main memory,
fetch and compute only one chunk at a time for the largest plane
• Limitation of the method: computing well only for a small number of dimensions – If there are a large number of dimensions, “top-
down” computation and iceberg cube computation methods can be explored
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 194
![Page 195: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/195.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 195
Iceberg Cube
• In a data cube, many aggregate cells are trivial – Having an aggregate too small
• Iceberg query
![Page 196: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/196.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 196
Monotonic Iceberg Condition
• If COUNT(a, b, *)<100, then COUNT(a, b, c)<100 for any c
• For cells c1 and c2, c1 is called an ancestor of c2 if in all dimensions that c1 takes a non-* value, c2 agrees with c1 – (a,b,*) is an ancestor of (a,b,c)
• An iceberg condition P is monotonic if for any aggregate cell c failing P, any descendants of c cannot honor P
![Page 197: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/197.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 197
BUC
• Once a base table (A, B, C) is sorted by A-B-C, aggregates (*,*,*), (A,*,*), (A,B,*) and (A,B,C) can be computed with one scan and 4 counters
• To compute other aggregates, we can sort the base table in some other orders
![Page 198: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/198.jpg)
Example
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 198
Location Year Color Amount Vancouver 2015 Yellow 300 Victoria 2014 Red 400 Seattle 2015 Green 120 Vancouver 2014 Green 260 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160
Threshold: sum() >= 300
![Page 199: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/199.jpg)
Example: Sorting on Location
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 199
Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2015 Yellow 300 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2014 Green 260 Victoria 2014 Red 400
Sum(Seattle, *, *) = 280 ✗ Sum(Vancouver, *, *) = 1000 ✓ Sum(Victoria, *, *) = 400 ✓
![Page 200: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/200.jpg)
Sorting on Year for Vancouver
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 200
Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2014 Green 260 Vancouver 2015 Yellow 300 Vancouver 2015 Red 160 Victoria 2014 Red 400
Sum(Vancouver, 2014, *) = 540 ✓ Sum(Vancouver, 2015, *) = 460 ✓
![Page 201: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/201.jpg)
Color on Vancouver & 2014/2015
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 201
Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2014 Yellow 280 Vancouver 2015 Red 160 Vancouver 2015 Yellow 300 Victoria 2014 Red 400
Sum(Vancouver, 2014, Yellow) = 280 ✗ Sum(Vancouver, 2014, Green) = 260 ✗ Sum(Vancouver, 2015, Yellow) = 300 ✓Sum(Vancouver, 2015, Red) = 160 ✗
![Page 202: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/202.jpg)
Sort on Color for Vancouver
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 202
Location Year Color Amount Seattle 2015 Green 120 Seattle 2015 Red 160 Vancouver 2014 Green 260 Vancouver 2015 Red 160 Vancouver 2014 Yellow 280 Vancouver 2015 Yellow 300 Victoria 2014 Red 400
Sum(Vancouver, *, Green) = 260 ✗ Sum(Vancouver, *, Red) = 160 ✗ Sum(Vancouver, *, Yellow) = 580 ✓
![Page 203: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/203.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 203
How to Sort the Base Table?
• General sorting in main memory O(nlogn) • Counting in main memory O(n), linear to the
number of tuples in the base table – How to sort 1 million integers in range 0 to 100? – Set up 100 counters, initiate them to 0’s – Scan the integers once, count the occurrences
of each value in 1 to 100 – Scan the integers again, put the integers to the
right places
![Page 204: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/204.jpg)
Jian Pei: CMPT 741/459 Data Warehousing and OLAP (3) 204
Pushing Monotonic Conditions
• BUC searches the aggregates bottom-up in depth-first manner
• Only when a monotonic condition holds, the descendants of the current node should be expanded
![Page 205: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/205.jpg)
Clustering
![Page 206: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/206.jpg)
Community Detection
Jian Pei: CMPT 741/459 Clustering (1) 206
http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social-media-1-728.jpg?cb=1308736811
![Page 207: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/207.jpg)
Customer Relation Management
• Partitioning customers into groups such that customers within a group are similar in some aspects
• A manager can be assigned to a group • Customized products and services can be
developed
Jian Pei: CMPT 741/459 Clustering (1) 207
![Page 208: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/208.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 208
What Is Clustering?
• Group data into clusters – Similar to one another within the same cluster – Dissimilar to the objects in other clusters – Unsupervised learning: no predefined classes
Cluster 1 Cluster 2
Outliers
![Page 209: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/209.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 209
Requirements of Clustering
• Scalability • Ability to deal with various types of attributes • Discovery of clusters with arbitrary shape • Minimal requirements for domain knowledge
to determine input parameters
![Page 210: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/210.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 210
Data Matrix
• For memory-based clustering – Also called object-by-variable structure
• Represents n objects with p variables (attributes, measures) – A relational table
⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
npxnfxnx
ipxifxix
pxfxx
!!"""""
!!"""""
!!
1
1
1111
![Page 211: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/211.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 211
Dissimilarity Matrix
• For memory-based clustering – Also called object-by-object structure – Proximities of pairs of objects – d(i, j): dissimilarity between objects i and j – Nonnegative – Close to 0: similar
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0,2)(,1)(
0(3,2)(3,1)0(2,1)
0
!!"""
ndnd
ddd
![Page 212: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/212.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 212
How Good Is Clustering?
• Dissimilarity/similarity depends on distance function – Different applications have different functions
• Judgment of clustering quality is typically highly subjective
![Page 213: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/213.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 213
Types of Data in Clustering
• Interval-scaled variables • Binary variables • Nominal, ordinal, and ratio variables • Variables of mixed types
![Page 214: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/214.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 214
Interval-valued Variables
• Continuous measurements of a roughly linear scale – Weight, height, latitude and longitude
coordinates, temperature, etc. • Effect of measurement units in attributes
– Smaller unit à larger variable range à larger effect to the result
– Standardization + background knowledge
![Page 215: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/215.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 215
Standardization
• Calculate the mean absolute deviation
• Calculate the standardized measurement (z-score)
• Mean absolute deviation is more robust – The effect of outliers is reduced but remains
detectable
|)|...|||(|1 21 fnffffff mxmxmxns −++−+−= .)...211
nffff xx(xn m +++=
f
fifif s
mx z
−=
![Page 216: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/216.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 216
Similarity and Dissimilarity
• Distances are normally used measures • Minkowski distance: a generalization
• If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance • If q = ∞, d is Chebyshev distance • Weighed distance
)0(||...||||),(2211
>−++−+−= qjxixjxixjxixjid qq
pp
)0()||...||2
||1
),(2211
>−++−+−= qjxixpwjxixwjxixwjid qq
pp
![Page 217: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/217.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 217
Manhattan and Chebyshev Distance
Picture from Wekipedia
Manhattan Distance
http://brainking.com/images/rules/chess/02.gif
Chebyshev Distance
When n = 2, chess-distance
![Page 218: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/218.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 218
Properties of Minkowski Distance
• Nonnegative: d(i,j) ≥ 0 • The distance of an object to itself is 0
– d(i,i) = 0 • Symmetric: d(i,j) = d(j,i) • Triangular inequality
– d(i,j) ≤ d(i,k) + d(k,j) i j
k
![Page 219: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/219.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 219
Binary Variables
• A contingency table for binary data • Symmetric variable: each state carries the
same weight – Invariant similarity
• Asymmetric variable: the positive value carries more weight – Noninvariant similarity (Jacard)
tsrqsr jid +++
+=),(
srqsr jid ++
+=),(
Object j
Object i
1 0 Sum 1 q r q+r 0 s t s+t
Sum q+s r+t p
![Page 220: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/220.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 220
Nominal Variables
• A generalization of the binary variable in that it can take more than 2 states, e.g., Red, yellow, blue, green
• Method 1: simple matching – M: # of matches, p: total # of variables
• Method 2: use a large number of binary variables – Creating a new binary variable for each of the M
nominal states
pmpjid −=),(
![Page 221: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/221.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 221
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank • Can be treated like interval-scaled
– Replace xif by their rank – Map the range of each variable onto [0, 1] by
replacing the i-th object in the f-th variable by
– Compute the dissimilarity using methods for interval-scaled variables
11−−
=f
ifif M
rz
},...,1{ fif Mr ∈
![Page 222: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/222.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 222
Ratio-scaled Variables
• Ratio-scaled variable: a positive measurement on a nonlinear scale – E.g., approximately at exponential scale, such
as AeBt • Treat them like interval-scaled variables?
– Not a good choice: the scale can be distorted! • Apply logarithmic transformation, yif = log(xif) • Treat them as continuous ordinal data, treat
their rank as interval-scaled
![Page 223: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/223.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 223
Variables of Mixed Types
• A database may contain all the six types of variables – Symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio • One may use a weighted formula to combine
their effects
)(1
)()(1),(
fij
pf
fij
fij
pf d
jidδ
δ
=
=
Σ
Σ=
![Page 224: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/224.jpg)
Clustering Methods
• K-means and partitioning methods • Hierarchical clustering • Density-based clustering • Grid-based clustering • Pattern-based clustering • Other clustering methods
Jian Pei: CMPT 741/459 Clustering (1) 224
![Page 225: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/225.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 225
Partitioning Algorithms: Ideas
• Partition n objects into k clusters – Optimize the chosen partitioning criterion
• Global optimal: examine all possible partitions – (kn-(k-1)n-…-1) possible partitions, too expensive!
• Heuristic methods: k-means and k-medoids – K-means: a cluster is represented by the center – K-medoids or PAM (partition around medoids): each
cluster is represented by one of the objects in the cluster
![Page 226: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/226.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 226
K-means
• Arbitrarily choose k objects as the initial cluster centers
• Until no change, do – (Re)assign each object to the cluster to which
the object is the most similar, based on the mean value of the objects in the cluster
– Update the cluster means, i.e., calculate the mean value of the objects for each cluster
![Page 227: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/227.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 227
K-Means: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each object to the most similar center
Update the cluster means
Update the cluster means
reassign reassign
Jian Pei: Data Mining -- Clustering and Outlier Detection 33
K-Means: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassignreassign
Jian Pei: Data Mining -- Clustering and Outlier Detection 33
K-Means: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassignreassign
![Page 228: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/228.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 228
Pros and Cons of K-means
• Relatively efficient: O(tkn) – n: # objects, k: # clusters, t: # iterations; k, t <<
n. • Often terminate at a local optimum • Applicable only when mean is defined
– What about categorical data? • Need to specify the number of clusters • Unable to handle noisy data and outliers • Unsuitable to discover non-convex clusters
![Page 229: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/229.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 229
Variations of the K-means • Aspects of variations
– Selection of the initial k means – Dissimilarity calculations – Strategies to calculate cluster means
• Handling categorical data: k-modes – Use mode instead of mean
• Mode: the most frequent item(s) – A mixture of categorical and numerical data: k-prototype
method • EM (expectation maximization): assign a
probability of an object to a cluster (will be discussed later)
![Page 230: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/230.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 230
A Problem of K-means
• Sensitive to outliers – Outlier: objects with extremely large values
• May substantially distort the distribution of the data
• K-medoids: the most centrally located object in a cluster
+ +
Jian Pei: Data Mining -- Clustering and Outlier Detection 36
A Problem of K-means
• Sensitive to outliers– Outlier: objects with extremely large values
• May substantially distort the distribution of the data
• K-medoids: the most centrally located object in a cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
++
![Page 231: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/231.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 231
PAM: A K-medoids Method
• PAM: partitioning around Medoids • Arbitrarily choose k objects as the initial medoids • Until no change, do
– (Re)assign each object to the cluster to which the nearest medoid
– Randomly select a non-medoid object o’, compute the total cost, S, of swapping medoid o with o’
– If S < 0 then swap o with o’ to form the new set of k medoids
![Page 232: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/232.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 232
Swapping Cost
• Measure whether o’ is better than o as a medoid
• Use the squared-error criterion
– Compute Eo’-Eo
– Negative: swapping brings benefit
∑∑= ∈
=k
i Cpi
i
opdE1
2),(
![Page 233: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/233.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 233
PAM: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
Randomly select a nonmedoid object,Oramdom
Compute total cost of swapping
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
Jian Pei: Data Mining -- Clustering and Outlier Detection 39
PAM: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
Randomly select a nonmedoid object,Oramdom
Compute total cost of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Jian Pei: Data Mining -- Clustering and Outlier Detection 39
PAM: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
Randomly select a nonmedoid object,Oramdom
Compute total cost of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Jian Pei: Data Mining -- Clustering and Outlier Detection 39
PAM: Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary choose k object as initial medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each remaining object to nearest medoids
Randomly select a nonmedoid object,Oramdom
Compute total cost of swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O and Oramdom
If quality is improved.
Do loop
Until no change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 234: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/234.jpg)
Jian Pei: CMPT 741/459 Clustering (1) 234
Pros and Cons of PAM
• PAM is more robust than k-means in the presence of noise and outliers – Medoids are less influenced by outliers
• PAM is efficient for small data sets but does not scale well for large data sets – O(k(n-k)2) for each iteration
![Page 235: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/235.jpg)
Hierarchy
• An arrangement or classification of things according to inclusiveness
• A natural way of abstraction, summarization, compression, and simplification for understanding
• Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality
of the hierarchy Jian Pei: CMPT 459/741 Clustering (2) 235
![Page 236: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/236.jpg)
• Group data objects into a tree of clusters • Top-down versus bottom-up
Jian Pei: CMPT 459/741 Clustering (2) 236
Hierarchical Clustering
Step 0 Step 1 Step 2 Step 3 Step 4
b
d c
e
a a b
d e c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative (AGNES)
divisive (DIANA)
![Page 237: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/237.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 237
AGNES (Agglomerative Nesting)
• Initially, each object is a cluster • Step-by-step cluster merging, until all objects
form a cluster – Single-link approach – Each cluster is represented by all of the objects
in the cluster – The similarity between two clusters is measured
by the similarity of the closest pair of data points belonging to different clusters
![Page 238: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/238.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 238
Dendrogram
• Show how to merge clusters hierarchically
• Decompose data objects into a multi-level nested partitioning (a tree of clusters)
• A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster
![Page 239: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/239.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 239
DIANA (Divisive ANAlysis)
• Initially, all objects are in one cluster • Step-by-step splitting clusters until each
cluster contains only one object
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
![Page 240: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/240.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 240
Distance Measures
• Minimum distance • Maximum distance • Mean distance • Average distance
∑∑∈ ∈
∈∈
∈∈
=
=
=
=
i j
ji
ji
Cp Cqjijiavg
jijimean
CqCpji
CqCpji
qpdnn
CCd
mmdCCd
qpdCCd
qpdCCd
),(1),(
),(),(
),(max),(
),(min),(
,max
,min
m: mean for a cluster C: a cluster n: the number of objects in a cluster
![Page 241: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/241.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 241
Challenges
• Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical
• High complexity O(n2) • Integrating hierarchical clustering with other
techniques – BIRCH, CURE, CHAMELEON, ROCK
![Page 242: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/242.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 242
BIRCH
• Balanced Iterative Reducing and Clustering using Hierarchies
• CF (Clustering Feature) tree: a hierarchical data structure summarizing object information – Clustering objects à clustering leaf nodes of the
CF tree
![Page 243: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/243.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 243
Clustering Feature: CF = (N, LS, SS)
N: Number of data points
LS: ∑Ni=1=oi
SS: ∑Ni=1=oi
2
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)
(4,7)
(3,8)
Clustering Feature Vector
![Page 244: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/244.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 244
CF-tree in BIRCH
• Clustering features – Summarize the statistics for a cluster – Many cluster quality measures (e.g., radium, distance)
can be derived – Additivity: CF1+CF2=(N1+N2, L1+L2, SS1+SS2)
• A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of children
![Page 245: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/245.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 245
CF Tree CF1 child1
CF3 child3
CF2 child2
CF6 child6
CF1 child1
CF3 child3
CF2 child2
CF5 child5
CF1
CF2
CF6
prev next CF1
CF2
CF4
prev next
B = 7 L = 6
Root
Non-leaf node
Leaf node Leaf node
![Page 246: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/246.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 246
Parameters of a CF-tree
• Branching factor: the maximum number of children
• Threshold: max diameter of sub-clusters stored at the leaf nodes
![Page 247: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/247.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 247
BIRCH Clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
![Page 248: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/248.jpg)
Jian Pei: CMPT 459/741 Clustering (2) 248
Pros & Cons of BIRCH
• Linear scalability – Good clustering with a single scan – Quality can be further improved by a few
additional scans • Can handle only numeric data • Sensitive to the order of the data records
![Page 249: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/249.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 249
Distance-based Methods: Drawbacks
• Hard to find clusters with irregular shapes • Hard to specify the number of clusters • Heuristic: a cluster must be dense
![Page 250: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/250.jpg)
How to Find Irregular Clusters?
• Divide the whole space into many small areas – The density of an area can be estimated – Areas may or may not be exclusive – A dense area is likely in a cluster
• Start from a dense area, traverse connected dense areas and discover clusters in irregular shape
Jian Pei: CMPT 459/741 Clustering (3) 250
![Page 251: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/251.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 251
Directly Density Reachable
• Parameters – Eps: Maximum radius of the neighborhood – MinPts: Minimum number of points in an Eps-
neighborhood of that point – NEps(p): {q | dist(p,q) ≤Eps}
• Core object p: |NEps(p)|≥MinPts – A core object is in a dense area
• Point q directly density-reachable from p iff q ∈NEps(p) and p is a core object
p q
MinPts = 3 Eps = 1 cm
![Page 252: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/252.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 252
Density-Based Clustering
• Density-reachable – Directly density reachable p1àp2, p2àp3, …, pn-1à pn – pn density-reachable from p1
• Density-connected – If points p, q are density-reachable from o then p and q
are density-connected
p q
o
p
q p1
![Page 253: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/253.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 253
DBSCAN
• A cluster: a maximal set of density-connected points – Discover clusters of arbitrary shape in spatial
databases with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
![Page 254: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/254.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 254
DBSCAN: the Algorithm
• Arbitrary select a point p • Retrieve all points density-reachable from p
wrt Eps and MinPts • If p is a core point, a cluster is formed • If p is a border point, no points are density-
reachable from p and DBSCAN visits the next point of the database
• Continue the process until all of the points have been processed
![Page 255: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/255.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 255
Challenges for DBSCAN
• Different clusters may have very different densities
• Clusters may be in hierarchies
![Page 256: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/256.jpg)
Biclustering
• Clustering both objects and attributes simultaneously
• Four requirements – Only a small set of objects in a cluster (bicluster) – A bicluster only involves a small number of
attributes – An object may participate in multiple biclusters
or no biclusters – An attribute may be involved in multiple
biclusters, or no biclusters
Jian Pei: CMPT 459/741 Clustering (3) 256
![Page 257: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/257.jpg)
Application Examples
• Recommender systems – Objects: users – Attributes: items – Values: user ratings
• Microarray data – Objects: genes – Attributes: samples – Values: expression levels
Jian Pei: CMPT 459/741 Clustering (3) 257
nmw
gene
sample/condition
w11w
21w31w
n1w
12w
32w22w
n2w
1mw
3mw2m
![Page 258: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/258.jpg)
Biclusters with Constant Values
Jian Pei: CMPT 459/741 Clustering (3) 258
11.2. CLUSTERING HIGH-DIMENSIONAL DATA 535
· · · b6 · · · b12 · · · b36 · · · b99 · · ·a1 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a33 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·a86 · · · 60 · · · 60 · · · 60 · · · 60 · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
Figure 11.5: A gene-condition matrix, a submatrix, and a bi-cluster.
subset of products. For example, AllElectronics is highly interested in findinga group of customers who all like the same group of products. Such a clusteris a submatrix in the customer-product matrix, where all elements have a highvalue. Using such a cluster, AllElectronics can make recommendations in twodirections. First, the company can recommend products to new customerswho are similar to the customers in the cluster. Second, the company canrecommend to customers new products that are similar to those involved inthe cluster.
As with bi-clusters in a gene expression data matrix, the bi-clusters in acustomer-product matrix usually have the following characteristics:
• Only a small set of customers participate in a cluster;
• A cluster involves only a small subset of products;
• A customer can participate in multiple clusters, or may not participatein any cluster at all; and
• A product may be involved in multiple clusters, or may not be involvedin any cluster at all.
Bi-clustering can be applied to customer-product matrices to mine clusterssatisfying the above requirements.
Types of Bi-clusters
“How can we model bi-clusters and mine them?” Let’s start with some basicnotation. For the sake of simplicity, we’ll use “genes” and “conditions” torefer to the two dimensions in our discussion. Our discussion can easily beextended to other applications. For example, we can simply replace “genes” and“conditions” by “customers” and “products” to tackle the customer-product bi-clustering problem.
Let A = {a1, . . . , an} be a set of genes and B = {b1, . . . , bm} be a set ofconditions. Let E = [eij ] be a gene expression data matrix, that is, a gene-condition matrix, where 1 ≤ i ≤ n and 1 ≤ j ≤ m. A submatrix I × J is
536 CHAPTER 11. ADVANCED CLUSTER ANALYSIS
10 10 10 10 1020 20 20 20 2050 50 50 50 500 0 0 0 0
Figure 11.6: A bi-cluster with constant values on rows.
10 50 30 70 2020 60 40 80 3050 90 70 110 600 40 20 60 10
Figure 11.7: A bi-cluster with coherent values.
defined by a subset I ⊆ A of genes and a subset J ⊆ B of conditions. Forexample, in the matrix shown in Figure 11.5, {a1, a33, a86} × {b6, b12, b36, b99}is a submatrix.
A bi-cluster is a submatrix where genes and conditions follow consistentpatterns. We can define different types of bi-clusters based on such patterns:
• As the simplest case, a submatrix I × J (I ⊆ A, J ⊆ B) is a bi-clusterwith constant values if for any i ∈ I and j ∈ J , eij = c, where c is aconstant. For example, the submatrix {a1, a33, a86}× {b6, b12, b36, b99} inFigure 11.5 is a bi-cluster with constant values.
• A bi-cluster is interesting if each row has a constant value, though differ-ent rows may have different values. A bi-cluster with constant valueson rows is a submatrix I × J such that for any i ∈ I and j ∈ J , theneij = c+αi where αi is the adjustment for row i. For example, Figure 11.6shows a bi-cluster with constant values on rows.
Symmetrically, a bi-cluster with constant values on columns is asubmatrix I × J such that for any i ∈ I and j ∈ J , then eij = c + βj ,where βj is the adjustment for column j.
• More generally, a bi-cluster is interesting if the rows change in a syn-chronized way with respect to the columns and vice versa. Mathemat-ically, a bi-cluster with coherent values (also known as a pattern-based cluster) is a submatrix I × J such that for any i ∈ I and j ∈ J ,eij = c + αi + βj , where αi and βj are the adjustment for row i andcolumn j, respectively. For example, Figure 11.7 shows a bi-cluster withcoherent values.
It can be shown that I × J is a bi-cluster with coherent values if andonly if for any i1, i2 ∈ I and j1, j2 ∈ J , then ei1j1 − ei2j1 = ei1j2 − ei2j2 .Moreover, instead of using addition, we can define bi-cluster with coherent
On rows
![Page 259: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/259.jpg)
Biclusters with Coherent Values
• Also known as pattern-based clusters
Jian Pei: CMPT 459/741 Clustering (3) 259
![Page 260: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/260.jpg)
Biclusters with Coherent Evolutions
• Only up- or down-regulated changes over rows or columns
Jian Pei: CMPT 459/741 Clustering (3) 260
11.2. CLUSTERING HIGH-DIMENSIONAL DATA 537
10 50 30 70 2020 100 50 1000 3050 100 90 120 800 80 20 100 10
Figure 11.8: A bi-cluster with coherent evolutions on rows.
values using multiplication, that is eij = c · αi · βj . Clearly, bi-clusterswith constant values on rows or columns are special cases of bi-clusterswith coherent values.
• In some applications, we may only be interested in the up- or down-regulated changes across genes or conditions without constraining theexact values. A bi-cluster with coherent evolutions on rows is asubmatrix I × J such that for any i1, i2 ∈ I and j1, j2 ∈ J , (ei1j1 −ei1j2)(ei2j1 − ei2j2) ≥ 0. For example, Figure 11.8 shows a bi-cluster withcoherent evolutions on rows. Symmetrically, we can define bi-clusterswith coherent evolutions on columns.
Next, we study how to mine bi-clusters.
Bi-clustering Methods
The above specification of the types of bi-clusters only considers ideal cases. Inreal data sets, such perfect bi-clusters rarely exist. When they do exist, theyare usually very small. Instead, random noise can affect the readings of eij andthus prevent a bi-cluster in nature from appearing in a perfect shape.
There are two major types of methods for discovering bi-clusters in datathat may come with noise. Optimization-based methods conduct an it-erative search. At each iteration, the submatrix with the highest significancescore is identified as a bi-cluster. The process terminates when a user-specifiedcondition is met. Due to cost concerns in computation, greedy search is oftenemployed to find local optimal bi-clusters. Enumeration methods use a tol-erance threshold to specify the degree of noise allowed in the bi-clusters to bemined, and then tries to enumerate all submatrices of bi-clusters that satisfythe requirements. We use the δ-Cluster and MaPle algorithms as examples toillustrate these ideas.
Optimization Using the δ-Cluster Algorithm
For a submatrix, I × J , the mean of the i-th row is
eiJ =1
|J |∑
j∈J
eij . (11.16)
Coherent evolutions on rows
![Page 261: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/261.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 261
Differences from Subspace Clustering
• Subspace clustering uses global distance/similarity measure
• Pattern-based clustering looks at patterns • A subspace cluster according to a globally
defined similarity measure may not follow the same pattern
![Page 262: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/262.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 262
Objects Follow the Same Pattern?
pScore
D1 D2
Objectblue
Obejctgreen
The less the pScore, the more consistent the objects
![Page 263: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/263.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 263
Pattern-based Clusters
• pScore: the similarity between two objects rx, ry on two attributes au, av
• δ-pCluster (R, D): for any objects rx, ry∈R and any attributes au, av∈D,
)..()..(....
vyvxuyuxvyuy
vxux arararararararar
pScore −−−=⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡
)0(....
≥≤⎟⎟⎠
⎞⎜⎜⎝
⎛⎥⎦
⎤⎢⎣
⎡δδ
vyuy
vxux
arararar
pScore
![Page 264: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/264.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 264
Maximal pCluster
• If (R, D) is a δ-pCluster , then every sub-cluster (R’, D’) is a δ-pCluster, where R’⊆R and D’⊆D – An anti-monotonic property – A large pCluster is accompanied with many
small pClusters! Inefficacious • Idea: mining only the maximal pClusters!
– A δ-pCluster is maximal if there exists no proper super cluster as a δ-pCluster
![Page 265: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/265.jpg)
Jian Pei: CMPT 459/741 Clustering (3) 265
Mining Maximal pClusters
• Given – A cluster threshold δ – An attribute threshold mina – An object threshold mino
• Task: mine the complete set of significant maximal δ-pClusters – A significant δ-pCluster has at least mino objects
on at least mina attributes
![Page 266: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/266.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 266
Grid-based Clustering Methods
• Ideas – Using multi-resolution grid data structures – Using dense grid cells to form clusters
• Several interesting methods – CLIQUE – STING – WaveCluster
![Page 267: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/267.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 267
CLIQUE
• Clustering In QUEst • Automatically identify subspaces of a high
dimensional data space • Both density-based and grid-based
![Page 268: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/268.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 268
CLIQUE: the Ideas
• Partition each dimension into the same number of equal length intervals – Partition an m-dimensional data space into non-
overlapping rectangular units • A unit is dense if the number of data points
in the unit exceeds a threshold • A cluster is a maximal set of connected
dense units within a subspace
![Page 269: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/269.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 269
CLIQUE: the Method
• Partition the data space and find the number of points in each cell of the partition – Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense • Identify clusters:
– Determine dense units in all subspaces of interests and connected dense units in all subspaces of interests
• Generate minimal description for the clusters – Determine the minimal cover for each cluster
![Page 270: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/270.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 270
Sala
ry
(10,
000)
age
Vaca
tion
30 50
20 30 40 50 60 age
5 4
3 1
2 6
7 0
Vaca
tion
(wee
k)
20 30 40 50 60 age
5 4
3 1
2 6
7 0
CLIQUE: An Example
![Page 271: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/271.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 271
CLIQUE: Pros and Cons
• Automatically find subspaces of the highest dimensionality with high density clusters
• Insensitive to the order of input – Not presume any canonical data distribution
• Scale linearly with the size of input • Scale well with the number of dimensions • The clustering result may be degraded at the
expense of simplicity of the method
![Page 272: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/272.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 272
Bad Cases for CLIQUE
Parts of a cluster may be missed
A cluster from CLIQUE may contain noise
![Page 273: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/273.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 273
Fuzzy Clustering
• Each point xi takes a probability wij to belong to a cluster Cj
• Requirements – For each point xi,
– For each cluster Cj
11
=∑=
k
jijw
mwm
iij <<∑
=1
0
![Page 274: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/274.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 274
Fuzzy C-Means (FCM)
Select an initial fuzzy pseudo-partition, i.e., assign values to all the wij
Repeat Compute the centroid of each cluster using the fuzzy
pseudo-partition Recompute the fuzzy pseudo-partition, i.e., the wij
Until the centroids do not change (or the change is below some threshold)
![Page 275: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/275.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 275
Critical Details
• Optimization on sum of the squared error (SSE):
• Computing centroids: • Updating the fuzzy pseudo-partition
– When p=2
∑∑= =
=k
j
m
iji
pijk cxdistwCCSSE
1 1
21 ),(),,( …
∑∑==
=m
i
pij
m
ii
pijj wxwc
11
/
∑=
−−=k
q
pqi
pjiij cxdistcxdistw
1
11
211
2 )),(/1()),(/1(
∑=
=k
qqijiij cxdistcxdistw
1
22 ),(/1),(/1
![Page 276: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/276.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 276
Choice of P
• When p à 1, FCM behaves like traditional k-means
• When p is larger, the cluster centroids approach the global centroid of all data points
• The partition becomes fuzzier as p increases
![Page 277: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/277.jpg)
Jian Pei: CMPT 459/741 Clustering (4) 277
Effectiveness
![Page 278: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/278.jpg)
Is a Clustering Good?
• Feasibility – Applying any clustering methods on a uniformly
distributed data set is meaningless • Quality
– Are the clustering results meeting users’ interest? – Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful – Clustering patients into clusters corresponding to
male or female is not meaningful
Jian Pei: CMPT 459/741 Clustering (4) 278
![Page 279: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/279.jpg)
Major Tasks
• Assessing clustering tendency – Are there non-random structures in the data?
• Determining the number of clusters or other critical parameters
• Measuring clustering quality
Jian Pei: CMPT 459/741 Clustering (4) 279
![Page 280: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/280.jpg)
Uniformly Distributed Data
• Clustering uniformly distributed data is meaningless
• A uniformly distributed data set is generated by a uniform data distribution
Jian Pei: CMPT 459/741 Clustering (4) 280
504CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
Figure 10.21: A data set that is uniformly distributed in the data space.
• Measuring clustering quality. After applying a clustering method on adata set, we want to assess how good the resulting clusters are. A numberof measures can be used. Some methods measure how well the clustersfit the data set, while others measure how well the clusters match theground truth, if such truth is available. There are also measures thatscore clusterings and thus can compare two sets of clustering results onthe same data set.
In the rest of this section, we discuss each of the above three topics.
10.6.1 Assessing Clustering Tendency
Clustering tendency assessment determines whether a given data set has a non-random structure, which may lead to meaningful clusters. Consider a dataset that does not have any non-random structure, such as a set of uniformlydistributed points in a data space. Even though a clustering algorithm mayreturn clusters for the data, those clusters are random and are not meaningful.
Example 10.9 Clustering requires non-uniform distribution of data. Figure 10.21shows a data set that is uniformly distributed in 2-dimensional data space.Although a clustering algorithm may still artificially partition the points intogroups, the groups will unlikely mean anything significant to the applicationdue to the uniform distribution of the data.
“How can we assess the clustering tendency of a data set?” Intuitively, wecan try to measure the probability that the data set is generated by a uniformdata distribution. This can be achieved using statistical tests for spatial ran-domness. To illustrate this idea, let’s look at a simple yet effective statisticcalled Hopkins Statistic.
The Hopkins Statistic is a spatial statistic that tests the spatial random-ness of a variable as distributed in a space. Given a data set, D, which isregarded as a sample of a random variable, o, we want to determine how faraway o is from being uniformly distributed in the data space. We calculate theHopkins Statistic as follows:
![Page 281: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/281.jpg)
Hopkins Statistic
• Hypothesis: the data is generated by a uniform distribution in a space
• Sample n points, p1, …, pn, uniformly from the space of D
• For each point pi, find the nearest neighbor of pi in D, let xi be the distance between pi and its nearest neighbor in D
Jian Pei: CMPT 459/741 Clustering (4) 281
xi = minv2D
{dist(pi, v)}
![Page 282: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/282.jpg)
Hopkins Statistic
• Sample n points, q1, …, qn, uniformly from D • For each qi, find the nearest neighbor of qi in
D – {qi}, let yi be the distance between qi and its nearest neighbor in D – {qi}
• Calculate the Hopkins Statistic H
Jian Pei: CMPT 459/741 Clustering (4) 282
yi = minv2D,v 6=qi
{dist(qi, v)}
H =
nPi=1
yi
nPi=1
xi +nP
i=1yi
![Page 283: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/283.jpg)
Explanation
• If D is uniformly distributed, then and would be close to each other, and thus H would be round 0.5
• If D is skewed, then would be substantially smaller, and thus H would be close to 0
• If H > 0.5, then it is unlikely that D has statistically significant clusters
Jian Pei: CMPT 459/741 Clustering (4) 283
nX
i=1
yi
nX
i=1
xi
nX
i=1
yi
![Page 284: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/284.jpg)
Finding the Number of Clusters
• Depending on many factors – The shape and scale of the distribution in the
data set – The clustering resolution required by the user
• Many methods exist – Set , each cluster has points on
average – Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant turning point)
Jian Pei: CMPT 459/741 Clustering (4) 284
k =
rn
2
p2n
![Page 285: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/285.jpg)
A Cross-Validation Method
• Divide the data set D into m parts • Use m – 1 parts to find a clustering • Use the remaining part as the test set to test
the quality of the clustering – For each point in the test set, find the closest
centroid or cluster center – Use the squared distances between all points in the
test set and the corresponding centroids to measure how well the clustering model fits the test set
• Repeat m times for each value of k, use the average as the quality measure
Jian Pei: CMPT 459/741 Clustering (4) 285
![Page 286: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/286.jpg)
Measuring Clustering Quality
• Ground truth: the ideal clustering determined by human experts
• Two situations – There is a known ground truth – the extrinsic
(supervised) methods, comparing the clustering against the ground truth
– The ground truth is unavailable – the intrinsic (unsupervised) methods, measuring how well the clusters are separated
Jian Pei: CMPT 459/741 Clustering (4) 286
![Page 287: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/287.jpg)
Quality in Extrinsic Methods
• Cluster homogeneity: the more pure the clusters in a clustering, the better the clustering
• Cluster completeness: objects in the same cluster in the ground truth should be clustered together
• Rag bag: putting a heterogeneous object into a pure cluster is worse than putting it into a rag bag
• Small cluster preservation: splitting a small cluster in the ground truth into pieces is worse than splitting a bigger one
Jian Pei: CMPT 459/741 Clustering (4) 287
![Page 288: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/288.jpg)
Bcubed Precision and Recall
• D = {o1, …, on} – L(oi) is the cluster of oi given by the ground truth
• C is a clustering on D – C(oi) is the cluster-id of oi in C
• For two objects oi and oj, the correctness is 1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0 otherwise
Jian Pei: CMPT 459/741 Clustering (4) 288
![Page 289: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/289.jpg)
Bcubed Precision and Recall
• Precision
• Recall
Jian Pei: CMPT 459/741 Clustering (4) 289
508CHAPTER 10. CLUSTER ANALYSIS: BASIC CONCEPTS AND METHODS
one, denoted by o, belong to the same category according to ground truth.Consider a clustering C2 identical to C1 except that o is assigned to acluster C′ = C in C2 such that C′ contains objects from various categoriesaccording to ground truth, and thus is noisy. In other words, C′ in C2 isa rag bag. Then, a clustering quality measure Q respecting the rag bagcriterion should give a higher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).
• Small cluster preservation. If a small category is split into small piecesin a clustering, those small pieces may likely become noise and thus thesmall category cannot be discovered from the clustering. The small clus-ter preservation criterion states that splitting a small category into piecesis more harmful than splitting a large category into pieces. Consider anextreme case. Let D be a data set of n + 2 objects such that, accord-ing to the ground truth, n objects, denoted by o1, . . . , on, belong toone category and the other 2 objects, denoted by on+1, on+2, belong toanother category. Suppose clustering C1 has three clusters, C1={o1, . . . ,on}, C2={on+1}, and C3={on+2}. Let clustering C2 have three clusters,too, namely C1={o1, . . . , on−1}, C2={on}, and C3={on+1, on+2}. Inother words, C1 splits the small category and C2 splits the big category.A clustering quality measure Q preserving small clusters should give ahigher score to C2, that is, Q(C2, Cg) > Q(C1, Cg).
Many clustering quality measures satisfy some of the above four criteria.Here, we introduce the BCubed precision and recall metrics, which satisfy allof the above criteria.
BCubed evaluates the precision and recall for every object in a clusteringon a given data set according to the ground truth. The precision of an objectindicates how many other objects in the same cluster belong to the same cat-egory as the object. The recall of an object reflects how many objects of thesame category are assigned to the same cluster.
Formally, let D ={o1, . . . , on} be a set of objects, and C be a clusteringon D. Let L(oi) (1 ≤ i ≤ n) be the category of oi given by ground truth,and C(oi) be the cluster ID of oi in C. Then, for two objects, oi and oj ,(1 ≤ i, j,≤ n, i = j), the correctness of the relation between oi and oj inclustering C is given by
Correctness(oi, oj) ={ 1 if L(oi) = L(oj)⇔ C(oi) = C(oj)
0 otherwise.(10.28)
BCubed precision is defined as
Precision BCubed =
n∑
i=1
∑
oj :i=j,C(oi)=C(oj)
Correctness(oi, oj)
∥{oj|i = j, C(oi) = C(oj)}∥n
. (10.29)
10.6. EVALUATION OF CLUSTERING 509
BCubed recall is defined as
Recall BCubed =
n∑
i=1
∑
oj :i=j,L(oi)=L(oj)
Correctness(oi, oj)
∥{oj|i = j, L(oi) = L(oj)}∥n
. (10.30)
Intrinsic Methods
When the ground truth of a data set is not available, we have to use an intrinsicmethod to assess the clustering quality. In general, intrinsic methods evaluatea clustering by examining how well the clusters are separated and how compactthe clusters are. Many intrinsic methods take the advantage of a similaritymetric between objects in the data set.
The silhouette coefficient is such a measure. For a data set D of nobjects, suppose D is partitioned into k clusters, C1, . . . , Ck. For each object o∈ D, we calculate a(o) as the average distance between o and all other objectsin the cluster to which o belongs. Similarly, b(o) is the minimum averagedistance from o to all clusters to which o does not belong. Formally, supposeo ∈ Ci (1 ≤ i ≤ k), then
a(o) =
∑
o′∈Ci,o=o′ dist(o, o′)
|Ci|− 1(10.31)
and
b(o) = minCj :1≤j≤k,j =i
{
∑
o′∈Cjdist(o, o′)
|Cj |}. (10.32)
The silhouette coefficient of o is then defined as
s(o) =b(o)− a(o)
max{a(o), b(o)} . (10.33)
The value of the silhouette coefficient is between −1 and 1. The value ofa(o) reflects the compactness of the cluster to which o belongs. The smallerthe value is, the more compact the cluster is. The value of b(o) capturesthe degree to which o is separated from other clusters. The larger b(o) is,the more separated o is from other clusters. Therefore, when the silhouettecoefficient value of o approaches 1, the cluster containing o is compact and ois far away from other clusters, which is the preferable case. However, whenthe silhouette coefficient value is negative (that is, b(o) < a(o)), this meansthat, in expectation, o is closer to the objects in another cluster than to theobjects in the same cluster as o. In many cases, this is a bad case, and shouldbe avoided.
To measure the fitness of a cluster within a clustering, we can compute theaverage silhouette coefficient value of all objects in the cluster. To measure thequality of a clustering, we can use the average silhouette coefficient value of allobjects in the data set. The silhouette coefficient and other intrinsic measures
![Page 290: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/290.jpg)
Silhouette Coefficient
• No ground truth is assumed • Suppose a data set D of n objects is partitioned
into k clusters, C1, …, Ck • For each object o,
– Calculate a(o), the average distance between o and every other object in the same cluster – compactness of a cluster, the smaller, the better
– Calculate b(o), the minimum average distance from o to every objects in a cluster that o does not belong to – degree of separation from other clusters, the larger, the better
Jian Pei: CMPT 459/741 Clustering (4) 290
![Page 291: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/291.jpg)
Silhouette Coefficient
• Then
• Use the average silhouette coefficient of all objects as the overall measure
Jian Pei: CMPT 459/741 Clustering (4) 291
b(o) = minCj :o 62Cj
{
Po
02Cj
dist(o, o0)
|Cj
| }
a(o) =
Po,o
02Ci,o0 6=o
dist(o, o0)
|Ci
|� 1
s(o) =
b(o)� a(o)
max{a(o), b(o)}
![Page 292: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/292.jpg)
Classification
![Page 293: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/293.jpg)
Jian Pei: CMPT 741/459 Classification (1) 293
Classification and Prediction
• Classification: predict categorical class labels – Build a model for a set of classes/concepts – Classify loan applications (approve/decline)
• Prediction: model continuous-valued functions – Predict the economic growth in 2015
![Page 294: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/294.jpg)
Jian Pei: CMPT 741/459 Classification (1) 294
Classification: A 2-step Process
• Model construction: describe a set of predetermined classes – Training dataset: tuples for model construction
• Each tuple/sample belongs to a predefined class
– Classification rules, decision trees, or math formulae
• Model application: classify unseen objects – Estimate accuracy of the model using an independent
test set – Acceptable accuracy à apply the model to classify
tuples with unknown class labels
![Page 295: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/295.jpg)
Jian Pei: CMPT 741/459 Classification (1) 295
Model Construction
Training Data
Classification Algorithms
IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’
Classifier (Model)
Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes
Dave Ass. Prof 6 No Anne Asso. Prof 3 No
![Page 296: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/296.jpg)
Jian Pei: CMPT 741/459 Classification (1) 296
Model Application
Classifier
Testing Data Unseen Data
(Jeff, Professor, 4)
Tenured? Name Rank Years Tenured Tom Ass. Prof 2 No
Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes
![Page 297: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/297.jpg)
Jian Pei: CMPT 741/459 Classification (1) 297
Supervised/Unsupervised Learning
• Supervised learning (classification) – Supervision: objects in the training data set have
labels – New data is classified based on the training set
• Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes or clusters in the data
![Page 298: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/298.jpg)
Jian Pei: CMPT 741/459 Classification (1) 298
Data Preparation
• Data cleaning – Preprocess data in order to reduce noise and
handle missing values • Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes • Data transformation
– Generalize and/or normalize data
![Page 299: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/299.jpg)
Jian Pei: CMPT 741/459 Classification (1) 299
Measurements of Quality
• Prediction accuracy • Speed and scalability
– Construction speed and application speed • Robustness: handle noise and missing
values • Scalability: build model for large training data
sets • Interpretability: understandability of models
![Page 300: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/300.jpg)
Jian Pei: CMPT 741/459 Classification (1) 300
Decision Tree Induction
• Decision tree representation • Construction of a decision tree • Inductive bias and overfitting • Scalable enhancements for large databases
![Page 301: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/301.jpg)
Jian Pei: CMPT 741/459 Classification (1) 301
Decision Tree
• A node in the tree – a test of some attribute • A branch: a possible value of the attribute • Classification
– Start at the root – Test the attribute – Move down the tree branch
Outlook
Sunny Overcast Rain
Humidity
High Normal
No Yes
Yes Wind
Strong Weak
No Yes
![Page 302: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/302.jpg)
Jian Pei: CMPT 741/459 Classification (1) 302
Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No
![Page 303: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/303.jpg)
Jian Pei: CMPT 741/459 Classification (1) 303
Appropriate Problems
• Instances are represented by attribute-value pairs – Extensions of decision trees can handle real-
valued attributes • Disjunctive descriptions may be required • The training data may contain errors or
missing values
![Page 304: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/304.jpg)
Jian Pei: CMPT 741/459 Classification (1) 304
Basic Algorithm ID3
• Construct a tree in a top-down recursive divide-and-conquer manner – Which attribute is the best at the current node? – Create a node for each possible attribute value – Partition training data into descendant nodes
• Conditions for stopping recursion – All samples at a given node belong to the same class – No attribute remained for further partitioning
• Majority voting is employed for classifying the leaf
– There is no sample at the node
![Page 305: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/305.jpg)
Jian Pei: CMPT 741/459 Classification (1) 305
Which Attribute Is the Best?
• The attribute most useful for classifying examples
• Information gain and gini index – Statistical properties – Measure how well an attribute separates the
training examples
![Page 306: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/306.jpg)
Jian Pei: CMPT 741/459 Classification (1) 306
Entropy
• Measure homogeneity of examples
– S is the training data set, and pi is the proportion of S belong to class i
• The smaller the entropy, the purer the data set
∑=
−≡c
iii ppSEntropy
12log)(
![Page 307: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/307.jpg)
Jian Pei: CMPT 741/459 Classification (1) 307
Information Gain
• The expected reduction in entropy caused by partitioning the examples according to an attribute
∑∈
−≡)(
)(||||)(),(
AValuesvv
v SEntropySSSEntropyASGain
Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v
![Page 308: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/308.jpg)
Jian Pei: CMPT 741/459 Classification (1) 308
Example Outlook Temp Humid Wind PlayTenni
s Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No 94.0145log
145
149log
149)( 22
=
−−=SEntropy
048.000.1146811.0
14894.0
)(146)(
148)(
)(||||)(),(
},{
=×−×−=
−−=
−= ∑∈
StrongWeak
StrongWeakvv
v
SEngropySEngropySEntropy
SEntropySSSEntropyWindSGain
![Page 309: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/309.jpg)
Jian Pei: CMPT 741/459 Classification (1) 309
Hypothesis Space Search in Decision Tree Building • Hypothesis space: the set of possible
decision trees • ID3: simple-to-complex, hill-climbing search
– Evaluation function: information gain
![Page 310: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/310.jpg)
Jian Pei: CMPT 741/459 Classification (1) 310
Capabilities and Limitations
• The hypothesis space is complete • Maintains only a single current hypothesis • No backtracking
– May converge to a locally optimal solution • Use all training examples at each step
– Make statistics-based decisions – Not sensitive to errors in individual example
![Page 311: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/311.jpg)
Jian Pei: CMPT 741/459 Classification (1) 311
Natural Bias
• The information gain measure favors attributes with many values
• An extreme example – Attribute “date” may have the highest
information gain – A very broad decision tree of depth one – Inapplicable to any future data
![Page 312: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/312.jpg)
Jian Pei: CMPT 741/459 Classification (1) 312
Alternative Measures
• Gain ratio: penalize attributes like date by incorporating split information –
• Split information is sensitive to how broadly and uniformly the attribute splits the data
– • Gain ratio can be undefined or very large
– Only test attributes with over average gain
||||log
||||),(
12 SS
SSASmationSplitInfor i
c
i
i∑=
−≡
),(),(),(
ASmationSplitInforASGainASGainRatio ≡
![Page 313: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/313.jpg)
Jian Pei: CMPT 741/459 Classification (1) 313
Measuring Inequality
Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree of inequality
Gini index
Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution
![Page 314: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/314.jpg)
Jian Pei: CMPT 741/459 Classification (1) 314
Gini Index (Adjusted)
• A data set S contains examples from n classes
– pj is the relative frequency of class j in S • A data set S is split into two subsets S1 and
S2 with sizes N1 and N2 respectively
• The attribute provides the smallest ginisplit(T) is chosen to split the node
∑=
−=n
jp jTgini121)(
)()()( 22
11 Tgini
NNTgini
NNTginisplit +=
![Page 315: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/315.jpg)
Jian Pei: CMPT 741/459 Classification (1) 315
Extracting Classification Rules
• Classification rules can be extracted from a decision tree
• Each path from the root to a leaf à an IF-THEN rule – All attribute-value pair along a path form a
conjunctive condition – The leaf node holds the class prediction – IF age = “<=30” AND student = “no” THEN
buys_computer = “no” • Rules are easy to understand
![Page 316: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/316.jpg)
Jian Pei: CMPT 741/459 Classification (1) 316
Inductive Bias
• The set of assumptions that, together with the training data, deductively justifies the classification to future instances – Preferences of the classifier construction
• Shorter trees are preferred over longer trees • Trees that place high information gain
attributes close to the root are preferred
![Page 317: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/317.jpg)
Jian Pei: CMPT 741/459 Classification (1) 317
Why Prefer Short Trees?
• Occam’s razor: prefer the simplest hypothesis that fits the data
• Fewer short trees than long trees • A short tree is less likely to be a statistical
coincidence
“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony
![Page 318: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/318.jpg)
Jian Pei: CMPT 741/459 Classification (1) 318
Overfitting
• A decision tree T may overfit the training data – if there exists an alternative tree T’ such that T
has a higher accuracy than T’ over the training examples, but T’ has a higher accuracy than T over the entire distribution of data
• Why overfitting? – Noise data – Bias in training data All data Training data
T T’
![Page 319: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/319.jpg)
Jian Pei: CMPT 741/459 Classification (2) 319
The Evaluation Issues
• The accuracy of a classifier can be evaluated using a test data set – The test set is a part of the available labeled
data set • But how can we evaluate the accuracy of a
classification method? – A classification method can generate many
classifiers • What if the available labeled data set is too
small?
![Page 320: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/320.jpg)
Jian Pei: CMPT 741/459 Classification (2) 320
Holdout Method
• Partition the available labeled data set into two disjoint subsets: the training set and the test set – 50-50 – 2/3 for training and 1/3 for testing
• Build a classifier using the training set • Evaluate the accuracy using the test set
![Page 321: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/321.jpg)
Jian Pei: CMPT 741/459 Classification (2) 321
Limitations of Holdout Method
• Fewer labeled examples for training • The classifier highly depends on the
composition of the training and test sets – The smaller the training set, the larger the
variance • If the test set is too small, the evaluation is
not reliable • The training and test sets are not
independent
![Page 322: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/322.jpg)
Jian Pei: CMPT 741/459 Classification (2) 322
Cross-Validation
• Each record is used the same number of times for training and exactly once for testing
• K-fold cross-validation – Partition the data into k equal-sized subsets – In each round, use one subset as the test set, and use
the rest subsets together as the training set – Repeat k times – The total error is the sum of the errors in k rounds
• Leave-one-out: k = n – Utilize as much data as possible for training – Computationally expensive
![Page 323: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/323.jpg)
Jian Pei: CMPT 741/459 Classification (2) 323
Accuracy Can Be Misleading …
• Consider a data set of 99% of the negative class and 1% of the positive class
• A classifier predicts everything negative has an accuracy of 99%, though it does not work for the positive class at all!
• Imbalance class distribution is popular in many applications – Medical applications, fraud detection, …
![Page 324: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/324.jpg)
Jian Pei: CMPT 741/459 Classification (2) 324
Performance Evaluation Matrix
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
FNFPTNTPTNTP
dcbada
++++
=+++
+=Accuracy
Confusion matrix (contingency table, error matrix): used for imbalance class distribution
![Page 325: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/325.jpg)
Jian Pei: CMPT 741/459 Classification (2) 325
Performance Evaluation Matrix
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
True positive rate (TPR, sensitivity) = TP / (TP + FN) True negative rate (TNR, specificity) = TN / (TN + FP) False positive rate (FNR) = FP / (TN + FP) False negative rate (FNR) = FN / (TP + FN)
![Page 326: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/326.jpg)
Jian Pei: CMPT 741/459 Classification (2) 326
Recall and Precision
• Target class is more important than the other classes
PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN)
Precision p = TP / (TP + FP) Recall r = TP / (TP + FN)
![Page 327: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/327.jpg)
Jian Pei: CMPT 741/459 Classification (2) 327
Fallout
• Type I errors – false positive: a negative object is classified as positive – Fallout: the type I error rate, FP / (TP + FP)
• Type II errors – false negative: a positive object is classified as negative – Captured by recall
![Page 328: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/328.jpg)
Jian Pei: CMPT 741/459 Classification (2) 328
Fβ Measure
• How can we summarize precision and recall into one metric? – Using the harmonic mean between the two
• Fβ measure
– β = 0, Fβ is the precision – β = ∞, Fβ is the recall – 0 < β < ∞, Fβ is a tradeoff between the precision and the
recall
FNFPTPTP
prrp
++=
+=
222(F) measure-F
Fβ =(β 2 +1)rpr +β 2p
=(β 2 +1)TP
(β 2 +1)TP +β 2FN +FP
![Page 329: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/329.jpg)
Jian Pei: CMPT 741/459 Classification (2) 329
Weighted Accuracy
• A more general metric
dwcwbwawdwaw
4321
41Accuracy Weighted+++
+=
Measure w1 w2 w3 w4 Recall 1 1 0 0
Precision 1 0 1 0
Fβ β2 + 1 β2 1 0
Accuracy 1 1 1 1
![Page 330: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/330.jpg)
Jian Pei: CMPT 741/459 Classification (2) 330
ROC Curve
• Receiver Operating Characteristic (ROC) 1-dimensional data set containing 2 classes. Any points located at x > t is classified as positive
![Page 331: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/331.jpg)
Jian Pei: CMPT 741/459 Classification (2) 331
ROC Curve (TP,FP): • (0,0): declare everything
to be negative class • (1,1): declare everything
to be positive class • (1,0): ideal • Diagonal line:
– Random guessing – Below diagonal line:
prediction is opposite of the true class Figure from [Tan, Steinbach, Kumar]
![Page 332: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/332.jpg)
Jian Pei: CMPT 741/459 Classification (2) 332
Comparing Two Classifiers
Figure from [Tan, Steinbach, Kumar]
![Page 333: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/333.jpg)
Jian Pei: CMPT 741/459 Classification (2) 333
Cost-Sensitive Learning
• In some applications, misclassifying some classes may be disastrous – Tumor detection, fraud detection
• Using a cost matrix PREDICTED CLASS
ACTUAL CLASS
Class=Yes Class=No Class=Yes -1 100 Class=No 1 0
![Page 334: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/334.jpg)
Jian Pei: CMPT 741/459 Classification (2) 334
Sampling for Imbalance Classes
• Consider a data set containing 100 positive examples and 1,000 negative examples
• Undersampling: use a random sample of 100 negative examples and all positive examples – Some useful negative examples may be lost – Run undersampling multiple times, use the ensemble of
multiple base classifiers – Focused undersampling: remove negative samples that
are not useful for classification, e.g., those far away from the decision boundary
![Page 335: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/335.jpg)
Jian Pei: CMPT 741/459 Classification (2) 335
Oversampling
• Replicate the positive examples until the training set has an equal number of positive and negative examples
• For noisy data, may cause overfitting
![Page 336: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/336.jpg)
Jian Pei: CMPT 741/459 Classification (3) 336
Errors in Classification
• Bias: the difference between the real class boundary and the decision boundary of a classification model
• Variance: variability in the training data set • Intrinsic noise in the target class: the target
class can be non-deterministic – instances with the same attribute values can have different class labels
![Page 337: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/337.jpg)
Jian Pei: CMPT 741/459 Classification (3) 337
One or More?
• What if a medical doctor is not sure about a case? – Joint-diagnosis: using a group of doctors carrying
different expertise – Wisdom from crowd is often more accurate
• All eager learning methods make prediction using a single classifier induced from training data – A single classifier may have low confidence in some
cases • Ensemble methods: construct a set of base
classifiers and take a vote on predictions in classification
![Page 338: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/338.jpg)
Jian Pei: CMPT 741/459 Classification (3) 338
Ensemble Classifiers Original
Training data
....D1 D2 Dt-1 Dt
D
Step 1:Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:Build Multiple
Classifiers
C*Step 3:
CombineClassifiers C*(x)=Vote(C1(x), …, Ck(x))
Figure from [Tan, Steinbach, Kumar]
![Page 339: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/339.jpg)
Jian Pei: CMPT 741/459 Classification (3) 339
Why May Ensemble Method Work?
• Suppose there are two classes and each base classifier has an error rate of 35%
• What if we use 25 base classifiers? – If all base classifiers are identical, the ensemble
error rate is still 35% – If base classifiers are independent, the
ensemble makes a wrong prediction only if more than half of the base classifiers are wrong
∑=
− =⎟⎟⎠
⎞⎜⎜⎝
⎛25
13
25 06.065.035.025
i
ii
i
![Page 340: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/340.jpg)
Jian Pei: CMPT 741/459 Classification (3) 340
Ensemble Error Rate
Figure from [Tan, Steinbach, Kumar]
![Page 341: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/341.jpg)
Jian Pei: CMPT 741/459 Classification (3) 341
Ensemble Classifiers – When?
• The base classifiers should be independent of each other
• Each base classifier should do better than a classifier that performs random guessing
![Page 342: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/342.jpg)
Jian Pei: CMPT 741/459 Classification (3) 342
How to Construct Ensemble?
• Manipulating the training set: derive multiple training sets and build a base classifier on each
• Manipulating the input features: use only a subset of features in a base classifier
• Manipulating the class labels: if there are many classes, in a classifier, randomly divide the classes into two subsets A and B; for a test case, if a base classifier predicts its class as A, all classes in A receive a vote
• Manipulating the learning algorithm, e.g., using different network configuration in ANN
![Page 343: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/343.jpg)
Jian Pei: CMPT 741/459 Classification (3) 343
Bootstrap
• Given an original training set T, derive a tranining set T’ by repeatedly uniformly sampling with replacement
• If T has n tuples, each tuple has a probability p = 1 - (1 - 1/n)n of being selected in T’ – When n à ∞, p à 1 - 1/e ≈ 0.632
• Use the tuples not in T’ as the test set
![Page 344: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/344.jpg)
Jian Pei: CMPT 741/459 Classification (3) 344
Bootstrap
• Use a bootstrap sample as the training set, use the tuples not in the training set as the test set
• .632 bootstrap: compute the overall accuracy by combining the accuracies of each bootstrap sample with the accuracy computed from a classifier using the whole data set as the training set
)368.0632.0(11
632. all
k
ibootstrap acck
acc ×+×= ∑ ε
![Page 345: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/345.jpg)
Jian Pei: CMPT 741/459 Classification (3) 345
Bagging • Run bootstrap k times to obtain k base classifiers • A test instance is assigned to the class that
receives the highest number of votes • Strength: reduce the variance of base classifiers –
good for unstable base classifiers – Unstable classifiers: sensitive to minor perturbations in
the training set, e.g., decision trees, associative classifiers, and ANN
• For stable classifiers (e.g., linear discriminant analysis and kNN classifiers), bagging may even degrade the performance since the training sets are smaller
• Less overfitting on noisy data
![Page 346: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/346.jpg)
Jian Pei: CMPT 741/459 Classification (3) 346
Boosting • Assign a weight to each training example
– Initially, each example is assigned a weight 1/n • Weights can be used in one of the following ways
– Weights as a sampling distribution to draw a set of bootstrap samples from the original training set
– Weights used by a base classifier to learn a model biased towards heavier examples
• Adaptively change the weight at the end of each boosting round – The weight of an example correctly classified decreases – The weight of an example incorrectly classified
increases • Each round generates a base classifier
![Page 347: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/347.jpg)
Jian Pei: CMPT 741/459 Classification (3) 347
Critical Design Choices in Boosting
• How the weights of the training examples are updated at the end of each boosting round?
• How the predictions made by base classifiers are combined?
![Page 348: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/348.jpg)
Jian Pei: CMPT 741/459 Classification (3) 348
AdaBoost
• Each base classifier carries an importance score related to its error rate – Error rate
– wi: weight, I(p) = 1 if p is true – Importance score
( )∑=
≠=N
jjjiji yxCIw
N 1)(1
ε
⎟⎟⎠
⎞⎜⎜⎝
⎛ −=
i
ii ε
εα
1ln21
![Page 349: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/349.jpg)
Jian Pei: CMPT 741/459 Classification (3) 349
How Does Importance Score Work?
![Page 350: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/350.jpg)
Jian Pei: CMPT 741/459 Classification (3) 350
Weight Adjustment in AdaBoost
– If any intermediate rounds generate an error rate more than 50%, the weights are reverted back to 1/n
• The ensemble error rate is bounded
∑ =
⎪⎩
⎪⎨⎧
≠
==
+
−+
i
)1(
)()1(
1 factor,ion normalizat theis where
)( ifexp)( ifexp
jij
iij
iij
j
jij
i
wZ
yxCyxC
Zww
j
j
α
α
∏ −≤i
iiensemblee )1( εε
![Page 351: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/351.jpg)
Jian Pei: CMPT 741/459 Classification (4) 351
Intuition – Bayesian Classification
• More hockey fans in Canada than in US – Which country is Tom, a hockey ball fan, from? – Predicting Canada has a better chance to be right
• Prior probability P(Canadian)=5%: reflect background knowledge 5% of total population is Canadians
• P(hockey fan | Canadian)=30%: the probability of a Canadian who is also a hockey fan
• Posterior probability P(Canadian | hockey fan): the probability of a hockey fan is from Canada
![Page 352: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/352.jpg)
Jian Pei: CMPT 741/459 Classification (4) 352
Bayes Theorem
• Find the maximum a posteriori (MAP) hypothesis
– Require background knowledge – Computational cost
)()()|()|(
DPhPhDPDhP =
)()|(max)()()|(max)|(max
hPhDPDP
hPhDPDhPh
Hh
HhHhMAP
∈
∈∈
=
=≡
![Page 353: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/353.jpg)
Jian Pei: CMPT 741/459 Classification (4) 353
Naïve Bayes Classifier
• Assumption: attributes are independent • Given a tuple (a1, a2, …, an), predict its
class as
– : the value of x that maximizes f(x) • Example:
∏=
=
jiji
i
iini
CaPCP
CPCaaaPC
)|()(maxarg
)()|,,,(maxarg 21 …
)(maxarg xf3maxarg 2
}3,2,1{−=
−∈x
x
![Page 354: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/354.jpg)
Jian Pei: CMPT 741/459 Classification (4) 354
Example: Training Dataset
Data sample X = (Outlook=sunny, Temp=mild, Humid=high Wind=weak) Will she play tennis? Yes
Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No
P(Yes|X) = P(X|Yes) P(Yes) = 0.014 P(No|X) = P(X|No) P(No) = 0.007
![Page 355: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/355.jpg)
Probability of Infrequent Values
• (outlook = Sunny, temp = high, humid = low, wind = weak)?
• P(humid = low) = 0
Jian Pei: CMPT 741/459 Classification (4) 355
Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No
Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No
Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes
Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes
Rain Mild High Strong No
![Page 356: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/356.jpg)
Smoothing
• Suppose an attribute has n different values: a1, …, an
• Assume a small enough value ε > 0 • Let Pi be the frequency of ai,
Pi = # tuples having ai / total # of tuples • Estimate
Jian Pei: CMPT 741/459 Classification (4) 356
P (ai) = ✏+1� n✏
nPi
![Page 357: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/357.jpg)
Characteristics of Naïve Bayes
• Robust to isolated noise points – Such points are averaged out in probability
computation • Insensitive to missing values • Robust to irrelevant attributes
– Distributions on such attributes are almost uniform
• Correlated attributes degrade the performance
Jian Pei: CMPT 741/459 Classification (4) 357
![Page 358: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/358.jpg)
Bayes Error Rate
• The error rate of the ideal naïve Bayes classifier
Jian Pei: CMPT 741/459 Classification (4) 358
Err =
xZ
0
P (Crocodile | X)dX +
1Z
x
P (Alligator | X)dX
![Page 359: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/359.jpg)
Jian Pei: CMPT 741/459 Classification (4) 359
Pros and Cons
• Pros – Easy to implement – Good results obtained in many cases
• Cons – A (too) strong assumption: independent
attributes • How to handle dependent/correlated
attributes? – Bayesian belief networks
![Page 360: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/360.jpg)
Jian Pei: CMPT 741/459 Classification (4) 360
Associative Classification
• Mine association possible rules (PR) in form of condset à c – Condset: a set of attribute-value pairs – C: class label
• Build classifier – Organize rules according to decreasing
precedence based on confidence and support • Classification
– Use the first matching rule to classify an unknown case
![Page 361: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/361.jpg)
Jian Pei: CMPT 741/459 Classification (4) 361
Associative Classification Methods
• CBA (Classification By Association: Liu, Hsu & Ma, KDD’98) – Mine association possible rules in the form of
• Cond-set (a set of attribute-value pairs) à class label
– Build classifier: Organize rules according to decreasing precedence based on confidence and then support
• CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01) – Classification: Statistical analysis on multiple rules
![Page 362: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/362.jpg)
Jian Pei: CMPT 741/459 Classification (4) 362
Instance-based Methods
• Instance-based learning – Store training examples and delay the processing until a
new instance must be classified (“lazy evaluation”) • Typical approaches
– K-nearest neighbor approach • Instances represented as points in an Euclidean space
– Locally weighted regression • Construct local approximation
– Case-based reasoning • Use symbolic representations and knowledge-based inference
![Page 363: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/363.jpg)
Jian Pei: CMPT 741/459 Classification (4) 363
The K-Nearest Neighbor Method
• Instances are points in an n-D space • The k-nearest neighbors (KNN) in the
Euclidean distance – Return the most common value among the k
training examples nearest to the query point • Discrete-/real-valued target functions
. _
+ _ xq
+
_ _ +
_
_
+
![Page 364: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/364.jpg)
Jian Pei: CMPT 741/459 Classification (4) 364
KNN Methods
• For continuous-valued target functions, return the mean value of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm – Give greater weights to closer neighbors
• Robust to noisy data by averaging k-nearest neighbors
• Curse of dimensionality – Distance could be dominated by irrelevant attributes – Axes stretch or elimination of the least relevant attributes
wd xq xi
≡ 12( , )
![Page 365: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/365.jpg)
Jian Pei: CMPT 741/459 Classification (4) 365
Lazy vs. Eager Learning
• Efficiency: lazy learning uses less training time but more predicting time
• Accuracy – Lazy method effectively uses a richer hypothesis
space – Eager: must commit to a single hypothesis that
covers the entire instance space
![Page 366: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/366.jpg)
Outlier Detection
![Page 367: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/367.jpg)
Motivation: Fraud Detection
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 367
http://i.imgur.com/ckkoAOp.gif
![Page 368: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/368.jpg)
Techniques: Fraud Detection
• Features • Dissimilarity • Groups and noise
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 368
http://i.stack.imgur.com/tRDGU.png
![Page 369: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/369.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 369
Outlier Analysis
• “One person’s noise is another person’s signal”
• Outliers: the objects considerably dissimilar from the remainder of the data – Examples: credit card fraud, Michael Jordon,
intrusions, etc – Applications: credit card fraud detection, telecom
fraud detection, intrusion detection, customer segmentation, medical analysis, etc
![Page 370: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/370.jpg)
Outliers and Noise
• Different from noise – Noise is random error or variance in a measured
variable • Outliers are interesting: an outlier violates
the mechanism that generates the normal data
• Outlier detection vs. novelty detection – Early stage may be regarded as outliers – But later merged into the model
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 370
![Page 371: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/371.jpg)
Types of Outliers
• Three kinds: global, contextual and collective outliers – A data set may have multiple types of outlier – One object may belong to more than one type of
outlier • Global outlier (or point anomaly)
– An outlier object significantly deviates from the rest of the data set
• challenge: find an appropriate measurement of deviation
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 371
![Page 372: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/372.jpg)
Contextual Outliers • An outlier object deviates significantly based on a
selected context – Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?) • Attributes of data objects should be divided into two
groups – Contextual attributes: defines the context, e.g., time & location – Behavioral attributes: characteristics of the object, used in
outlier evaluation, e.g., temperature • A generalization of local outliers—whose density
significantly deviates from its local area • Challenge: how to define or formulate meaningful
context?
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 372
![Page 373: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/373.jpg)
Collective Outliers
• A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers – Application example: intrusion detection when a
number of computers keep sending denial-of-service packages to each other
• Detection of collective outliers – Consider not only behavior of individual objects, but
also that of groups of objects – Need to have the background knowledge on the
relationship among data objects, such as a distance or similarity measure on objects
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 373
![Page 374: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/374.jpg)
Outlier Detection: Challenges
• Modeling normal objects and outliers properly – Hard to enumerate all possible normal behaviors in
an application – The border between normal and outlier objects is
often a gray area • Application-specific outlier detection
– Choice of distance measure among objects and the model of relationship among objects are often application-dependent
– Example: clinic data: a small deviation could be an outlier; while in marketing analysis, larger fluctuations
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 374
![Page 375: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/375.jpg)
Outlier Detection: Challenges
• Handling noise in outlier detection – Noise may distort the normal objects and blur the
distinction between normal objects and outliers – Noise may help hide outliers and reduce the
effectiveness of outlier detection • Understandability
– Understand why these are outliers: Justification of the detection
– Specify the degree of an outlier: the unlikelihood of the object being generated by a normal mechanism
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 375
![Page 376: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/376.jpg)
Outlier Detection Methods
• Whether user-labeled examples of outliers can be obtained – Supervised, semi-supervised, and unsupervised
methods • Assumptions about normal data and outliers
– Statistical, proximity-based, and clustering-based methods
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 376
![Page 377: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/377.jpg)
Supervised Methods • Modeling outlier detection as a classification problem
– Samples examined by domain experts used for training & testing • Methods for Learning a classifier for outlier detection effectively:
– Model normal objects & report those not matching the model as outliers, or
– Model outliers and treat those not matching the model as normal • Challenges
– Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers
– Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 377
![Page 378: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/378.jpg)
Unsupervised Methods • Assume the normal objects are somewhat
``clustered'‘ into multiple groups, each having some distinct features
• An outlier is expected to be far away from any groups of normal objects
• Weakness: Cannot detect collective outlier effectively – Normal objects may not share any strong patterns, but
the collective outliers may share high similarity in a small area
• Many clustering methods can be adapted for unsupervised methods – Find clusters, then outliers: not belonging to any cluster
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 378
![Page 379: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/379.jpg)
Unsupervised Methods: Challenges
• In some intrusion or virus detection, normal activities are diverse – Unsupervised methods may have a high false
positive rate but still miss many real outliers. – Supervised methods can be more effective, e.g.,
identify attacking some key resources • Challenges
– Hard to distinguish noise from outliers – Costly since first clustering: but far less outliers than
normal objects • Newer methods: tackle outliers directly
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 379
![Page 380: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/380.jpg)
Semi-Supervised Methods • In many applications, the number of labeled data is often
small – Labels could be on outliers only, normal objects only, or both
• If some labeled normal objects are available – Use the labeled examples and the proximate unlabeled
objects to train a model for normal objects – Those not fitting the model of normal objects are detected as
outliers • If only some labeled outliers are available, a small
number of labeled outliers many not cover the possible outliers well – To improve the quality of outlier detection, one can get help
from models for normal objects learned from unsupervised methods
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 380
![Page 381: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/381.jpg)
Pros and Cons
• Effectiveness of statistical methods: highly depends on whether the assumption of statistical model holds in the real data
• There are rich alternatives to use various statistical models – Parametric vs. non-parametric
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 381
![Page 382: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/382.jpg)
Proximity-based Methods
• An object is an outlier if the nearest neighbors of the object are far away, i.e., the proximity of the object is significantly deviates from the proximity of most of the other objects in the same data set
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 382
![Page 383: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/383.jpg)
Pros and Cons
• The effectiveness of proximity-based methods highly relies on the proximity measure
• In some applications, proximity or distance measures cannot be obtained easily
• Often have a difficulty in identifying a group of outliers that stay close to each other
• Two major types of proximity-based outlier detection methods – Distance-based vs. density-based
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 383
![Page 384: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/384.jpg)
Clustering-based Methods
• Normal data belong to large and dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to any clusters
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 384
![Page 385: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/385.jpg)
Challenges
• Since there are many clustering methods, there are many clustering-based outlier detection methods as well
• Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 385
![Page 386: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/386.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 386
Statistical Outlier Analysis
• Assumption: the objects in a data set are generated by a (stochastic) process (a generative model)
• Learn a generative model fitting the given data set, and then identify the objects in low probability regions of the model as outliers
• two categories: parametric versus non-parametric
![Page 387: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/387.jpg)
Example
• Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model – The data not following the model are outliers.
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 387
![Page 388: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/388.jpg)
Parametric Methods
• Assumption: the normal data is generated by a parametric distribution with parameter θ
• The probability density function of the parametric distribution f(x | θ) gives the probability that object x is generated by the distribution
• The smaller this value, the more likely x is an outlier
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 388
![Page 389: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/389.jpg)
Univariate Outliers Based on Normal Distribution
• Taking derivatives with respect to µ and σ2, we derive the following maximum likelihood estimates
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 389
lnL(µ,�2) =nX
i=1
ln f(xi | (u,�2)) = �n
2ln(2⇡)� n
2ln�2 � 1
2�2
nX
i=1
(xi � µ)2
µ = x =1
n
nX
i=1
xi �
2 =1
n
nX
i=1
(xi � x)2
![Page 390: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/390.jpg)
Example
• Daily average temperature: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
• Since n = 10, • Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is
an outlier since µ ± 3σ contains 99.7% data
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 390
µ = 28.61 � =p2.29 = 1.51
![Page 391: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/391.jpg)
The Grubb’s Test
• Maximum normed residual test • For each object x in a data set, compute its
z-score – x is an outlier if
– is the value taken by a t-distribution at a significance level of α/(2N), and N is the number of objects in the data set
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 391
z � N � 1pN
vuut t2↵2N ,N�2
N � 2 + t2↵2N ,N�2
t2↵2N ,N�2
![Page 392: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/392.jpg)
Non-parametric Method
• Not assume an a-priori statistical model, instead, determine the model from the input data – Not completely parameter free but consider the
number and nature of the parameters are flexible and not fixed in advance
• Examples: histogram and kernel density estimation
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 392
![Page 393: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/393.jpg)
Histogram
• A transaction in the amount of $7,500 is an outlier, since only 0.2% transactions have an amount higher than $5,000
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 393
![Page 394: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/394.jpg)
Challenges
• Hard to choose an appropriate bin size for histogram – Too small bin size → normal objects in empty/
rare bins, false positive – Too big bin size → outliers in some frequent
bins, false negative
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (1) 394
![Page 395: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/395.jpg)
Proximity-based Outlier Detection
• Objects far away from the others are outliers • The proximity of an outlier deviates significantly
from that of most of the others in the data set • Distance-based outlier detection: An object o is
an outlier if its neighborhood does not have enough other points
• Density-based outlier detection: An object o is an outlier if its density is relatively much lower than that of its neighbors
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 395
![Page 396: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/396.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 396
Depth-based Methods
• Organize data objects in layers with various depths – The shallow layers are more likely to contain
outliers • Example: Peeling, Depth contours • Complexity O(N⎡k/2⎤) for k-d datasets
– Unacceptable for k>2
![Page 397: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/397.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 397
Depth-based Outliers: Example
![Page 398: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/398.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 398
Distance-based Outliers
• A DB(p, D)-outlier is an object O in a dataset T such that at least a fraction p of the objects in T lie at a distance greater than distance D from O
• The larger D, the more outlying • The larger p, the more outlying
![Page 399: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/399.jpg)
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 399
Density-based Local Outlier
Both o1 and o2 are outliers Distance-based methods can detect o1, but not o2
![Page 400: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/400.jpg)
Intuition
• Outliers comparing to their local neighborhoods, instead of the global data distribution
• The density around an outlier object is significantly different from the density around its neighbors
• Use the relative density of an object against its neighbors as the indicator of the degree of the object being outliers
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 400
![Page 401: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/401.jpg)
Classification-based Outlier Detection
• Train a classification model that can distinguish “normal” data from outliers
• A brute-force approach: Consider a training set that contains some samples labeled as “normal” and others labeled as “outlier” – A training set in practice is typically heavily
biased: the number of “normal” samples likely far exceeds that of outlier samples
– Cannot detect unseen anomaly
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 401
![Page 402: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/402.jpg)
One-Class Model
• A classifier is built to describe only the normal class • Learn the decision boundary of the normal class
using classification methods such as SVM • Any samples that do not belong to the normal class
(not within the decision boundary) are declared as outliers
• Advantage: can detect new outliers that may not appear close to any outlier objects in the training set
• Extension: Normal objects may belong to multiple classes
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 402
![Page 403: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/403.jpg)
One-Class Model
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 403
![Page 404: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/404.jpg)
Semi-Supervised Learning Methods
• Combine classification-based and clustering-based methods
• Method – Use a clustering-based approach to find a large cluster,
C, and a small cluster, C1 – Since some objects in C carry the label “normal”, treat all
objects in C as normal – Use the one-class model of this cluster to identify normal
objects in outlier detection – Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers – Any object that does not fall into the model for C (such
as a) is considered an outlier as well
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 404
![Page 405: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/405.jpg)
Example
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 405
![Page 406: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/406.jpg)
Pros and Cons
• Pros: Outlier detection is fast • Cons: Quality heavily depends on the availability
and quality of the training set, • It is often difficult to obtain representative and high-
quality training data
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 406
![Page 407: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/407.jpg)
Contextual Outliers • An outlier object deviates significantly based on a
selected context – Ex. Is 10C in Vancouver an outlier? (depending on summer or
winter?) • Attributes of data objects should be divided into two
groups – Contextual attributes: defines the context, e.g., time & location – Behavioral attributes: characteristics of the object, used in
outlier evaluation, e.g., temperature • A generalization of local outliers—whose density
significantly deviates from its local area • Challenge: how to define or formulate meaningful
context?
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 407
![Page 408: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/408.jpg)
Detection of Contextual Outliers
• If the contexts can be clearly identified, transform it to conventional outlier detection – Identify the context of the object using the
contextual attributes – Calculate the outlier score for the object in the
context using a conventional outlier detection method
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 408
![Page 409: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/409.jpg)
Example
• Detect outlier customers in the context of customer groups – Contextual attributes: age group, postal code – Behavioral attributes: the number of transactions per
year, annual total transaction amount • Method
– Locate c’s context; – Compare c with the other customers in the same
group; and – Use a conventional outlier detection method
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 409
![Page 410: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/410.jpg)
Modeling Normal Behavior
• Model the “normal” behavior with respect to contexts – Use a training data set to train a model that predicts the
expected behavior attribute values with respect to the contextual attribute values
– An object is a contextual outlier if its behavior attribute values significantly deviate from the values predicted by the model
• Use a prediction model to link the contexts and behavior – Avoid explicit identification of specific contexts – Some possible methods: regression, Markov Models,
and Finite State Automaton …
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 410
![Page 411: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/411.jpg)
Collective Outliers
• Objects as a group deviate significantly from the entire data
• Examine the structure of the data set, i.e, the relationships between multiple data objects – The structures are often not explicitly defined,
and have to be discovered as part of the outlier detection process.
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 411
![Page 412: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/412.jpg)
Detecting High Dimensional Outliers
• Interpretability of outliers – Which subspaces manifest the outliers or an
assessment regarding the “outlying-ness” of the objects • Data sparsity: data in high-D spaces are often sparse
– The distance between objects becomes heavily dominated by noise as the dimensionality increases
• Data subspaces – Local behavior and patterns of data
• Scalability with respect to dimensionality – The number of subspaces increases exponentially
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 412
![Page 413: Introduction - Simon Fraser University...– The sample should be truly random • On a data set of hundreds or thousands of attributes, can sampling help in – Finding subcategories](https://reader035.fdocuments.in/reader035/viewer/2022071421/611a29377c953905e073aabe/html5/thumbnails/413.jpg)
Angle-based Outliers
Jian Pei: CMPT 741/459 Data Mining -- Outlier Detection (2) 413