Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs...

Midterm Review

1-Intro

• Data Mining vs. Statistics– Predictive v. experimental; hypotheses vs data-

driven• Different types of data• Data Mining pitfalls

– With lots of data you can find anything• Data privacy and security

– Good and bad examples

2- EDA and Visualization

• Good visualization is good analysis• Examples of vis

– 1-d, 2-d, multivariate– Histograms, boxplots, scatterplots, density

estimates, etc– Overplotting with many points– Conditional plots (small multiples)– Good, bad examples

3- Data mining concepts

• Preparing data for analysis– How to deal with missing data?– What are good transformations?– How to deal with outliers

• Data reduction– Reducing n: sampling, subsetting– Reducing p:

• Principal components: finding projections that preserve variance

– Scree plot shows how much variance is accounted for in the PC

• MDS: – Needs a distance matrix– Mimimizes ‘stress function’– mostly used for visualization and EDA

• In-vs-out of sample evaluation– In-sample: must penalize for complexity– Out-of-sample: use cross-validation to evaluate

predictive performance

3- Data mining concepts

• Complexity/Performance tradeoff• Evaluating Classification models

– Accuracy (how many did I get right): not the best choice

– Precision/recall or Sensitivity/specificity tradeoff– Selecting different thresholds for ROC curve.

4-Regression

• Linear regression– What is it, what are the assumptions, how do you

check them– Model selection

• Exhaustive or Greedy (forward/backward selection) search

• Extensions of Linear regression– Non-linear in parameters, linear in form– Generalized Linear Models

• Logisitic regression• Poisson regression

– Shrinkage• Ridge regression• Lasso regression• Profile plots show the trace of parameter estimates

– Principal component regression– Nonparametric models

• Smoothing splines

5-Classification

• Categorical or binary response – ‘supervised’ learning

• LDA: fit a parametric model to each class• Classification (decision) trees

– Binary splits on any predictor X– Best split found algorithmically by gini or entropy to

maximize purity– Best size can be found via cross validation– Can be unstable

• K-Nearest Neighbors– Tradeoff of large/small k

• Probabilistic models– Bayes error rate: best possible error if model is

correct– Naïve Bayes

• Independence assumption on p(xi|c)

6-Clustering

• No response variable – ‘unsupervised’ learning

• Needs distance measures– Euclidean, cosine, jaccard, edit, ordinal and

categorical• K-means

– Select initial solution– Classify points, than re-calculate means

• Hierarchical clustering– Solutions for all k from 1 to n– Dendrogram effective visualization– Different distance functions (links) will result in

different clusterings• Probabilistic

– Mixture models fit using EM algorithm– Model based clustering

Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs...

Documents

Transcript of Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs...