Introduction to Data Mining for Newbies

40
Introduction to Data Mining for Newbies Nov. 2th, 2012 @echojuliett

Transcript of Introduction to Data Mining for Newbies

Page 1: Introduction to Data Mining for Newbies

Introduction to Data Mining

for Newbies

Nov. 2th, 2012

@echojuliett

Page 2: Introduction to Data Mining for Newbies

Google Datacenter @Douglas County, Georgia “These colorful pipes send and receive water for cooling our facility. Also pictured is a G-Bike, the vehicle of choice for team members to get around outside our data centers.”

Source: http://www.google.com/about/datacenters/gallery/#/tech/10

Page 3: Introduction to Data Mining for Newbies

Eunjeong Lucy Park PhDs, Data scientist @SNU DMLab A person who live on lattes.

Find me at: http://dmlab.snu.ac.kr, http://lucypark.kr

3

Page 4: Introduction to Data Mining for Newbies

“All scientists are data scientists.” - Monica Rogati, Senior Research Scientist @LinkedIn

4 Source: http://xkcd.com/242/

Page 5: Introduction to Data Mining for Newbies

“Data is everywhere.”

5

Cell phone logs Tweets

Credit card transactions Manufacturing fault data

Politician data

Social networking data

Web documents

Page 6: Introduction to Data Mining for Newbies

“Data mining is…”

Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997.

• “…the process of exploration an analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.”

- Berry and Linoff, 1997

6

Page 7: Introduction to Data Mining for Newbies

“Data mining is…”

• “…the belief in data.”

- @echojuliett, 2012

• Inductive reasoning

Mathematical induction: prove for k=1, assume for k, then prove for k+1

Induction vs. prejudice: # of cases

Ex: What is your hobby?

7

Page 8: Introduction to Data Mining for Newbies

“Data mining is…”

8

Page 9: Introduction to Data Mining for Newbies

1. Basic Concepts of Data Mining

2. Origins of Data Mining

3. Data Mining Tools

4. Masters of Data Mining

9

Page 10: Introduction to Data Mining for Newbies

Data types

Structured data Unstructured data

Source: http://www.tipforest.com/t/83

Page 11: Introduction to Data Mining for Newbies

KNOWLEDGE

Target data

Preprocessed data

Patterns

DATA warehouse

Selection

Preprocessing

Data mining

Interpretation

of somewhat domain (Marketing, Finance, Manufacturing, etc.)

(the general) Data mining process

Page 12: Introduction to Data Mining for Newbies

• Data exploration

– How many variables?

• Independent variables, dependent variables, …

• Continuous variables, categorical variables, …

– How many records?

– What distribution?

– …

• Variable selection & dimensionality reduction

– Ex: Step-wise selection, PCA (Principal Component Analysis)

Selection

Page 13: Introduction to Data Mining for Newbies

Data set

Training data Validation data

• “Partitioning” the data – training data & validation data (& test data …)

Preprocessing

Page 14: Introduction to Data Mining for Newbies

Preprocessing

• Beware of “overfitting”

Source: Bishop, PRML, p.7

Page 15: Introduction to Data Mining for Newbies

Predictive methods Descriptive methods

Classification

Clustering

Regression

Association Rules

Learns a method for predicting the instance class from pre-labeled (classified) instances

Finds “natural” grouping of instances given un-labeled data

Method for discovering interesting relations between variables in large DBs An attempt to predict a continuous attribute

Data mining methods

Page 16: Introduction to Data Mining for Newbies

Regression • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN),

• Polynomial curve fitting

• The basic form

• The advanced form

• Example: • Tomorrow’s stock price = f (recent prices, economic indicators, …)

min

min

Page 17: Introduction to Data Mining for Newbies

Classification • Regression with a categorical dependent variable

• Naïve Bayes classification, decision trees, ANNs, SVMs,…

• Ex: E-mail spam detection

spam

inbox

?

Page 18: Introduction to Data Mining for Newbies

Clustering • Grouping of similar objects

• Unsupervised, Exploratory Knowledge Discovery

• k-means, hierarchical clustering, SOM, …

• Ex: Politician segmentation

322326321320325317316311323312315295296304303302298297299301319324288289 3 77133 84128129168132206131248265237 64 19 22 45164 82260183160268283191224281192234200277278226193263256204 93195 79165171 86264244211262199205189 89127 75178 32197181217169100101238276 85152 23 98170187172 87 92273 99 81240 76 28 78 97279282184 90182233207 34 33257 94235210146130214 55 40294300 96159290 31307308 4 27 95253218145 80126 91313314318 1 2 9 16 70 74 6 13136 44140116185137196225255208201254186270153231194161158104 72 65232259243163251 62173 60220120 48250142 58216 67280167 83143223 71229245272269236179 69156144 29106246241188117202180258 50 20215162 61 63261110242177108141252239198139 49154112203 56 25209115 41285287 47174 54150113105266 35175166249103274190 53 42213 38149 46157 68228219122 37 52 12 21107222119 51267 26286109118 17135284176 11 24 73221102227111212155 5 30 43 39 7 18 8 15147247 57 36 10 66 14138275151230148271124134114327121125292291123305328306293310309 59 880

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Jaccard Similarity based Hierarchical Clustering Dendrogram (D9)

Grand National Party (conservative)

Democratic United Party (liberal)

Others

Page 19: Introduction to Data Mining for Newbies

Association Rules

Source: http://lucypark.tistory.com/48

Page 20: Introduction to Data Mining for Newbies

Predictive methods Descriptive methods

Classification

Clustering

Regression

Association Rules

Learns a method for predicting the instance class from pre-labeled (classified) instances

Finds “natural” grouping of instances given un-labeled data

Method for discovering interesting relations between variables in large DBs An attempt to predict a continuous attribute

Data mining methods

Page 21: Introduction to Data Mining for Newbies

21

Pop quiz!

Page 22: Introduction to Data Mining for Newbies

22

Pop quiz!

Page 23: Introduction to Data Mining for Newbies

23

Pop quiz!

Page 24: Introduction to Data Mining for Newbies

24

Pop quiz!

Page 27: Introduction to Data Mining for Newbies

27

Pop quiz!

Page 28: Introduction to Data Mining for Newbies

1. Basic Concepts of Data Mining

2. Origins of Data Mining

3. Data Mining Tools

4. Masters of Data Mining

28

Page 29: Introduction to Data Mining for Newbies

Historical Note Data Fishing, Data Dredging: 1960-

• used by statisticians (as a bad name)

Knowledge Discovery in Databases (KDD): 1989-

• used by Artificial Intelligence (AI), Machine Learning (ML) communities

Data Mining, Data Analytics: 1990-

• used in DB communities, business

Big data: 2000-

Page 30: Introduction to Data Mining for Newbies

Comparisons • Data mining

• Statistics

• Machine learning

• Pattern recognition

• …

Page 31: Introduction to Data Mining for Newbies

1. Basic Concepts of Data Mining

2. Origins of Data Mining

3. Data Mining Tools

4. Masters of Data Mining

31

Page 33: Introduction to Data Mining for Newbies

SAS Enterprise Miner (“E-miner”)

Page 34: Introduction to Data Mining for Newbies

XLMiner • 15-day trial version available at http://www.solver.com/xlminer-data-mining

• Useful for prototyping

• Supports: • Preprocessing

• Data partitioning • Missing data imputation • Categorical data transformation • PCA (Principal Component Analysis)

• Algorithms • Multiple linear regression • k-NN (k nearest neighbors) • CART (classification and regression trees) • ANN (artificial neural networks) • Discriminant analysis • logistic regression • Naïve Bayes classification • Association rules • k-means clustering • Hierarchical clustering

Page 35: Introduction to Data Mining for Newbies

More… • Mathworks MATLAB / GNU Octave

Most DM algorithms are preinstalled

Relatively easy to learn

• General purpose programming languages

For example, C, Java, Python, etc.

Packages such as Orange(http://orange.biolab.si/) for Python are available

May be more fit for tasks like natural language processing

• Even more…

Try visiting http://www.kdnuggets.com/software/suites.html

Page 36: Introduction to Data Mining for Newbies

1. Basic Concepts of Data Mining

2. Origins of Data Mining

3. Data Mining Tools

4. Masters of Data Mining

36

Page 37: Introduction to Data Mining for Newbies

• Mitchell (Carnegie Mellon University)

• Vapnik (NEC Labs)

• Bishop (Microsoft Cambridge)

• Smola (Yahoo, Australian National University)

• Ng (Stanford University)

Foreign warriors

Page 38: Introduction to Data Mining for Newbies

• 조성준 (서울대)

• 조재희 (광운대)

• 조성배 (연세대)

• 이성임 (단국대)

• 김성범 (고려대)

Foreign warriors

Page 39: Introduction to Data Mining for Newbies

• [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001.

• [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006.

• [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010

References

Page 40: Introduction to Data Mining for Newbies

Any Questions?

?