Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs...
-
Upload
ralph-mccarthy -
Category
Documents
-
view
217 -
download
0
Transcript of Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs...
Midterm Review
1-Intro
• Data Mining vs. Statistics– Predictive v. experimental; hypotheses vs data-
driven• Different types of data• Data Mining pitfalls
– With lots of data you can find anything• Data privacy and security
– Good and bad examples
2- EDA and Visualization
• Good visualization is good analysis• Examples of vis
– 1-d, 2-d, multivariate– Histograms, boxplots, scatterplots, density
estimates, etc– Overplotting with many points– Conditional plots (small multiples)– Good, bad examples
3- Data mining concepts
• Preparing data for analysis– How to deal with missing data?– What are good transformations?– How to deal with outliers
• Data reduction– Reducing n: sampling, subsetting– Reducing p:
• Principal components: finding projections that preserve variance
– Scree plot shows how much variance is accounted for in the PC
• MDS: – Needs a distance matrix– Mimimizes ‘stress function’– mostly used for visualization and EDA
• In-vs-out of sample evaluation– In-sample: must penalize for complexity– Out-of-sample: use cross-validation to evaluate
predictive performance
3- Data mining concepts
• Complexity/Performance tradeoff• Evaluating Classification models
– Accuracy (how many did I get right): not the best choice
– Precision/recall or Sensitivity/specificity tradeoff– Selecting different thresholds for ROC curve.
4-Regression
• Linear regression– What is it, what are the assumptions, how do you
check them– Model selection
• Exhaustive or Greedy (forward/backward selection) search
• Extensions of Linear regression– Non-linear in parameters, linear in form– Generalized Linear Models
• Logisitic regression• Poisson regression
– Shrinkage• Ridge regression• Lasso regression• Profile plots show the trace of parameter estimates
– Principal component regression– Nonparametric models
• Smoothing splines
5-Classification
• Categorical or binary response – ‘supervised’ learning
• LDA: fit a parametric model to each class• Classification (decision) trees
– Binary splits on any predictor X– Best split found algorithmically by gini or entropy to
maximize purity– Best size can be found via cross validation– Can be unstable
• K-Nearest Neighbors– Tradeoff of large/small k
• Probabilistic models– Bayes error rate: best possible error if model is
correct– Naïve Bayes
• Independence assumption on p(xi|c)
6-Clustering
• No response variable – ‘unsupervised’ learning
• Needs distance measures– Euclidean, cosine, jaccard, edit, ordinal and
categorical• K-means
– Select initial solution– Classify points, than re-calculate means
• Hierarchical clustering– Solutions for all k from 1 to n– Dendrogram effective visualization– Different distance functions (links) will result in
different clusterings• Probabilistic
– Mixture models fit using EM algorithm– Model based clustering