CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper...

42
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong

Transcript of CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper...

Page 1: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

CS570 Introduction to Data Mining

Department of Mathematics and Computer Science

Li Xiong

Page 2: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

2

Today

Meeting everybody in class

Course topics

Course logistics

Page 3: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Instructor

Li Xiong Web: http://www.mathcs.emory.edu/~lxiong Email: [email protected] Office Hours: MW 11:15-12:15pm or by appt Office: MSC E412

Page 4: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

About Mehttp://www.mathcs.emory.edu/~lxiong

• Undergraduate teaching– CS170 Intro to CS I– CS171 Intro to CS II– CS377 Database systems– CS378 Data mining

• Graduate teaching– CS550 Database systems– CS570 Data mining– CS573 Data privacy and security– CS730R/CS584 Topics in data management – big data analytics

• Research– Data privacy and security– Spatiotemporal data management– health informatics

• Industry experience (software engineer)– SRA International – IBM internet security systems

4

Page 5: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

TA

• TA: Farnaz Tahmasebian– Email: [email protected]– Office Hours: TBA– Office: N414

5

Page 6: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Meet everyone in class

Group introduction (2-3 people) Introducing your group

Name Goals for taking the course Something interesting to share with the class

8/23/2017

Page 7: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

7

Today

Meeting everybody in class

Course topics

Course logistics

Page 8: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

8

Evolution of Sciences

Before 1600, empirical science

Knowledge must be based on observable phenomena

Natural science vs. social sciences

1600-1950s, theoretical science

Motivate experiments and generalize our understanding (e.g. theoretical physics)

1950s-now, computational science

Traditionally meant simulation (e.g. computational physics)

Evolving to include information management

1960-now, data science

Flood of data from new scientific instruments and simulations

Ability to economically store and manage petabytes of data online

Accessibility of the data through the Internet and computing Grid

Scientific information management poses Computer Science challenges: acquisition,

organization, query, analysis and visualization of the data

Jim Gray and Alex Szalay, The World Wide Telescope, Comm. ACM, 45(11): 50-54, Nov.

2002

Page 9: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 9

Evolution of Data and Information Science

1960s:

Data collection, database creation, network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

Data mining, data warehousing, multimedia databases, and Web

databases

2000s

Stream data management and mining

Data mining and its applications

Web technology (XML, data integration) and global information systems

Social networks

Page 10: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Big Data Tsunami

Page 11: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

The 5 V’s of Big Data

Page 12: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Transforming the world with data

Precision medicine Enriched daily lives and social systems

Page 13: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Value of Data

Precision medicine

Page 14: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Value of Data

GPS traces, call records Syndromic surveillance, social relationships

Page 15: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Value of Data

Shopping history Recommendations

Page 16: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

What the class is about

Page 17: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 17

What Is Data Mining?

Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown and

potentially useful) patterns or knowledge from huge amount of data

Data mining really means knowledge mining

We are drowning in data, but starving for knowledge!

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge

extraction, data/pattern analysis, data archeology, information

harvesting, business intelligence, etc.

Page 18: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 18

Knowledge Discovery (KDD) Process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection and

transformation

Data Mining

Pattern Evaluation

Page 19: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 19

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

MachineLearning

OtherDisciplines

Visualization

Artificial Intelligence

Page 20: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 20

Data Mining Functionalities

Predictive: predict the value of a particular attribute based on the values of other attributes Classification Regression

Descriptive: derive patterns that summarize the underlying relationships in data Pattern mining and association analysis Cluster analysis Ranking queries and skyline

Page 21: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Class Topics

Classical data mining and machine learning algorithms Frequent pattern mining Classification Clustering

Data exploration techniques Ranking (kNN, skyline)

Data mining applications and emerging challenges Spatiotemporal data mining (data variety) Truth discovery (data veracity) Privacy preserving data mining (data privacy)

21

Page 22: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Classification and prediction

Data Mining: Concepts and Techniques 22

Classification: construct models (functions) that describe

and distinguish classes for future prediction

Prediction/regression: predict unknown or missing

numerical values

Derived models can be represented as rules, mathematical

formulas, etc.

Topics

Classification: Decision tree, Bayesian classification,

Neural networks, Support vector machines, kNN

Regression: linear and non-linear regression

Ensemble methods

Page 23: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 23

Frequent pattern mining and association analysis

Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set

Frequent sequential pattern

Frequent structured pattern

Applications

Basket data analysis — Beer and diapers

Web log (click stream) analysis

DNA sequence analysis

Challenge: efficient algorithms to handle exponential size of

the search space

Topics

Algorithms: Apriori, Frequent pattern growth, Vertical format

Closed and maximal patterns

Association rules mining

23Data Mining: Concepts and Techniques

Page 24: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 24

Cluster and outlier analysis

Cluster analysis

Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns

Unsupervised learning (vs. supervised learning)

Maximizing intra-class similarity & minimizing interclass similarity

Outlier analysis

Outlier: Data object that does not comply with the general behavior of the data

Noise or exception? Useful in fraud detection, rare events analysis

E.g. Extreme large purchase

Page 25: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Clustering Analysis

Topics Partitioning based clustering: k-means Hierarchical clustering: classical, BIRCH Density based clustering: DBSCAN Model-based clustering: EM Cluster evaluation Outlier analysis

25Data Mining: Concepts and Techniques

Page 26: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Ranking queries and skyline

Data Mining: Concepts and Techniques 26

Topk and kNN queries

Skyline

Algorithms and various definitions

Page 27: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Spatiotemporal data mining

Trajectory mining Time series Applications

Mobility study Traffic prediction Location recommendation

Data Mining: Concepts and Techniques 27

Page 28: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Big Data and Privacy

Page 29: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Privacy Risks

Page 30: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Privacy Risks

Tracking Identification Profiling

Page 31: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Privacy preserving data mining

Topics: algorithms that allow data mining while preserving individual information

Challenge: tradeoff between privacy, accuracy, and efficiency

Data Mining: Concepts and Techniques 32

Page 32: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 33

Today

Meeting everybody in class

Course topics

Course logistics

Page 33: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Textbooks

Data mining: concepts and techniques. J. Han, M. Kamber, Jian Pei. 3rd edition

Mining of massive datasets. J. Leskovec, A. Rajaraman, J. Ullman Online version: http://www.mmds.org

G. James, D. Witten, T. Hastie, R. Tibshirani, An

Introductio to Statistical Learning, 2013

P.-N. Tan, M. Steinbach and V. Kumar, Introduction to

Data Mining, Wiley, 2005

Data Mining: Concepts and Techniques 34

Page 34: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data (Mining) Conferences

Data mining SIGKDD, ICDM, SDM, CIKM, PAKDD …

Data management SIGMOD, VLDB, ICDE, EDBT, CIKM …

Machine learning ICML, NIPS, AAAI, …

Data Mining: Concepts and Techniques 35

Page 35: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Workload

~3 programming assignments Implementation of classical algorithms and competition!

~3 reading assignments/paper reviews ~1 paper presentation 1 course project (team of up to 3 students) 1 midterm No final exam

36Data Mining: Concepts and Techniques

Page 36: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Paper reviews

1 page NOT just a summary of the paper, but your

critical opinion of the paper Format

Summary 3 strengths or things you like (S1, S2, S3 …) 3 weaknesses (W1, W2, W3 …) Potential extensions/ideas

Connect and contrast the paper to what we have learned/read so far

8/23/201737

Page 37: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Course Project

Options Comparative study and evaluation of existing

algorithms Design of new algorithms to solve new problems Data mining challenges

Timeline 10/16: proposal 11/27, 11/29, 12/4: Project workshop/presentation 12/16: project report/deliverables

Page 38: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Late Policy

Late assignment will be accepted within 3 days of the due date and penalized 10% per day

2 late assignment allowances, each can be used to turn in a single late assignment within 3 days of the due date without penalty.

Page 39: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Learning Objectives (Non technical)

Read papers and write paper critiques Present papers and lead discussions Learn/practice the life cycle of a research project

literature review problem formulation project proposal writing algorithm design experimental studies paper/project report writing

8/23/2017

Page 40: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Grading

Assignments/presentations 40% Final project 30% Midterm 30%

Page 41: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Some expectations

Participate in class, think critically, ask questions Read and write reviews critically Start on assignments and projects early Enjoy the class!

8/23/2017

Page 42: CS570 Introduction to Data Mining - Emory Universitylxiong/cs570/share/slides/01_intro.pdfPaper reviews 1 page NOT just a summary of the paper, but your critical opinion of the paper

Data Mining: Concepts and Techniques 43

Today

Meeting everybody in class

Course topics

Course logistics

Next lecture: data preprocessing