Introduction to Data Mining for Newbies
Transcript of Introduction to Data Mining for Newbies
Introduction to Data Mining
for Newbies
Nov. 2th, 2012
@echojuliett
Google Datacenter @Douglas County, Georgia “These colorful pipes send and receive water for cooling our facility. Also pictured is a G-Bike, the vehicle of choice for team members to get around outside our data centers.”
Source: http://www.google.com/about/datacenters/gallery/#/tech/10
Eunjeong Lucy Park PhDs, Data scientist @SNU DMLab A person who live on lattes.
Find me at: http://dmlab.snu.ac.kr, http://lucypark.kr
3
“All scientists are data scientists.” - Monica Rogati, Senior Research Scientist @LinkedIn
4 Source: http://xkcd.com/242/
“Data is everywhere.”
5
Cell phone logs Tweets
Credit card transactions Manufacturing fault data
Politician data
Social networking data
Web documents
“Data mining is…”
Source: Berry and Linoff, Data Mining Techniques: For Marketing, Sales, and Customer Support, New York: Wiley, 1997.
• “…the process of exploration an analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.”
- Berry and Linoff, 1997
6
“Data mining is…”
• “…the belief in data.”
- @echojuliett, 2012
• Inductive reasoning
Mathematical induction: prove for k=1, assume for k, then prove for k+1
Induction vs. prejudice: # of cases
Ex: What is your hobby?
7
“Data mining is…”
8
1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
9
Data types
Structured data Unstructured data
Source: http://www.tipforest.com/t/83
KNOWLEDGE
Target data
Preprocessed data
Patterns
DATA warehouse
Selection
Preprocessing
Data mining
Interpretation
of somewhat domain (Marketing, Finance, Manufacturing, etc.)
(the general) Data mining process
• Data exploration
– How many variables?
• Independent variables, dependent variables, …
• Continuous variables, categorical variables, …
– How many records?
– What distribution?
– …
• Variable selection & dimensionality reduction
– Ex: Step-wise selection, PCA (Principal Component Analysis)
Selection
Data set
Training data Validation data
• “Partitioning” the data – training data & validation data (& test data …)
Preprocessing
Preprocessing
• Beware of “overfitting”
Source: Bishop, PRML, p.7
Predictive methods Descriptive methods
Classification
Clustering
Regression
Association Rules
Learns a method for predicting the instance class from pre-labeled (classified) instances
Finds “natural” grouping of instances given un-labeled data
Method for discovering interesting relations between variables in large DBs An attempt to predict a continuous attribute
Data mining methods
Regression • Linear regression, k-nearest neighbors(k-NN), artificial neural networks (ANN),
…
• Polynomial curve fitting
• The basic form
• The advanced form
• Example: • Tomorrow’s stock price = f (recent prices, economic indicators, …)
min
min
Classification • Regression with a categorical dependent variable
• Naïve Bayes classification, decision trees, ANNs, SVMs,…
• Ex: E-mail spam detection
spam
inbox
?
Clustering • Grouping of similar objects
• Unsupervised, Exploratory Knowledge Discovery
• k-means, hierarchical clustering, SOM, …
• Ex: Politician segmentation
322326321320325317316311323312315295296304303302298297299301319324288289 3 77133 84128129168132206131248265237 64 19 22 45164 82260183160268283191224281192234200277278226193263256204 93195 79165171 86264244211262199205189 89127 75178 32197181217169100101238276 85152 23 98170187172 87 92273 99 81240 76 28 78 97279282184 90182233207 34 33257 94235210146130214 55 40294300 96159290 31307308 4 27 95253218145 80126 91313314318 1 2 9 16 70 74 6 13136 44140116185137196225255208201254186270153231194161158104 72 65232259243163251 62173 60220120 48250142 58216 67280167 83143223 71229245272269236179 69156144 29106246241188117202180258 50 20215162 61 63261110242177108141252239198139 49154112203 56 25209115 41285287 47174 54150113105266 35175166249103274190 53 42213 38149 46157 68228219122 37 52 12 21107222119 51267 26286109118 17135284176 11 24 73221102227111212155 5 30 43 39 7 18 8 15147247 57 36 10 66 14138275151230148271124134114327121125292291123305328306293310309 59 880
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Jaccard Similarity based Hierarchical Clustering Dendrogram (D9)
Grand National Party (conservative)
Democratic United Party (liberal)
Others
Predictive methods Descriptive methods
Classification
Clustering
Regression
Association Rules
Learns a method for predicting the instance class from pre-labeled (classified) instances
Finds “natural” grouping of instances given un-labeled data
Method for discovering interesting relations between variables in large DBs An attempt to predict a continuous attribute
Data mining methods
21
Pop quiz!
22
Pop quiz!
23
Pop quiz!
24
Pop quiz!
25 Source: http://www.cis.hut.fi/research/som-research/worldmap.html
Pop quiz!
26 Source: http://popupcity.net/2009/04/why-are-that-many-logos-blue/
Pop quiz!
27
Pop quiz!
1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
28
Historical Note Data Fishing, Data Dredging: 1960-
• used by statisticians (as a bad name)
Knowledge Discovery in Databases (KDD): 1989-
• used by Artificial Intelligence (AI), Machine Learning (ML) communities
Data Mining, Data Analytics: 1990-
• used in DB communities, business
Big data: 2000-
Comparisons • Data mining
• Statistics
• Machine learning
• Pattern recognition
• …
1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
31
R
Source: http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html
SAS Enterprise Miner (“E-miner”)
XLMiner • 15-day trial version available at http://www.solver.com/xlminer-data-mining
• Useful for prototyping
• Supports: • Preprocessing
• Data partitioning • Missing data imputation • Categorical data transformation • PCA (Principal Component Analysis)
• Algorithms • Multiple linear regression • k-NN (k nearest neighbors) • CART (classification and regression trees) • ANN (artificial neural networks) • Discriminant analysis • logistic regression • Naïve Bayes classification • Association rules • k-means clustering • Hierarchical clustering
More… • Mathworks MATLAB / GNU Octave
Most DM algorithms are preinstalled
Relatively easy to learn
• General purpose programming languages
For example, C, Java, Python, etc.
Packages such as Orange(http://orange.biolab.si/) for Python are available
May be more fit for tasks like natural language processing
• Even more…
Try visiting http://www.kdnuggets.com/software/suites.html
1. Basic Concepts of Data Mining
2. Origins of Data Mining
3. Data Mining Tools
4. Masters of Data Mining
36
• Mitchell (Carnegie Mellon University)
• Vapnik (NEC Labs)
• Bishop (Microsoft Cambridge)
• Smola (Yahoo, Australian National University)
• Ng (Stanford University)
Foreign warriors
• 조성준 (서울대)
• 조재희 (광운대)
• 조성배 (연세대)
• 이성임 (단국대)
• 김성범 (고려대)
Foreign warriors
• [1] Duda, Hart, Stork, Pattern Classification 2nd ed., Wiley, 2001.
• [2] Bishop, Pattern Recognition and Machine Learning (PRML), Springer, 2006.
• [3] Shmueli, Patel, Bruce, Data Mining for Business Intelligence, 2nd ed., Wiley, 2010
References
Any Questions?
?