Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf ·...
Transcript of Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf ·...
![Page 1: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/1.jpg)
Introduction to predictive modelingand data mining
Rebecca C. SteortsPredictive Modeling and Data Mining: STA 521
August 25 2015
1
![Page 2: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/2.jpg)
Today’s Menu
1. Brief history of data science (from slides of Bin Yu)
2. Motivation of this course.
3. What is predictive modeling and data science?
4. The boring bits (but they’re important).
2
![Page 3: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/3.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014] 3
![Page 4: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/4.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]4
![Page 5: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/5.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]
5
![Page 6: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/6.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]6
![Page 7: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/7.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]7
![Page 8: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/8.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]8
![Page 9: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/9.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014] 9
![Page 10: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/10.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014] 10
![Page 11: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/11.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]
11
![Page 12: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/12.jpg)
Data science
[Bin Yu, IMS Presidential Address, 2014]
12
![Page 13: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/13.jpg)
More about data science and how it’s relevant today....
13
![Page 14: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/14.jpg)
Data science, today
[Credit: Jenny Bryan] 14
![Page 15: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/15.jpg)
Data science, today
[Credit: Jenny Bryan] 15
![Page 16: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/16.jpg)
Data science, today
[Credit: Jenny Bryan]16
![Page 17: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/17.jpg)
Data science, today
[Credit: Jenny Bryan]
17
![Page 18: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/18.jpg)
So what is the class all about?
18
![Page 19: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/19.jpg)
What is data mining?
Data mining is the science of discovering structure and makingpredictions in (large) data sets
I Unsupervised learning: discovering structure
E.g., given measurements X1, . . . Xn, learn some underlyinggroup structure based on similarity
I Supervised learning: making predictions
I.e., given measurements (X1, Y1), . . . (Xn, Yn), learn a modelto predict Yi from Xi
Note: Hidden underneath is the idea of prediction (which we willget into). As we have talked about words like data science, datamining, etc are just more sexy!
19
![Page 20: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/20.jpg)
What is data mining?
Data mining is the science of discovering structure and makingpredictions in (large) data sets
I Unsupervised learning: discovering structure
E.g., given measurements X1, . . . Xn, learn some underlyinggroup structure based on similarity
I Supervised learning: making predictions
I.e., given measurements (X1, Y1), . . . (Xn, Yn), learn a modelto predict Yi from Xi
Note: Hidden underneath is the idea of prediction (which we willget into). As we have talked about words like data science, datamining, etc are just more sexy!
19
![Page 21: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/21.jpg)
Search Ads
Gmail Chrome
20
![Page 22: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/22.jpg)
People you may know
21
![Page 23: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/23.jpg)
Netflix
$1M prize!
22
![Page 24: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/24.jpg)
eHarmony
Falling in love with statistics
23
![Page 25: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/25.jpg)
FICO
An algorithm that could cause a lot of grief
24
![Page 26: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/26.jpg)
FlightCaster
Apparently it’s even used by airlines themselves
25
![Page 27: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/27.jpg)
IBM’s Watson
A combination of many things, including data mining
26
![Page 28: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/28.jpg)
Handwritten postal codes
(From ESL p. 404)
We could have robot mailmen someday
27
![Page 29: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/29.jpg)
Subtypes of breast cancer
Subtypes of breastcancer based on wound response
28
![Page 30: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/30.jpg)
Predicting Alzheimer’s disease
(From Raji et al. (2009), “Age, Alzheimer’s disease, and brainstructure”)
Can we predict Alzheimer’s disease years in advance?
29
![Page 31: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/31.jpg)
Kaggle 2015 Challenge
Competition to find interesting information in Census data andmaps.
30
![Page 32: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/32.jpg)
What to expect
Expect to be able to deal with messy data and writing coding thatis reproducible
Why? Real applied problems are messy (and for others tounderstand how you attacked something, it’s important that yourmethod, process, and code be accessible, well documented, andreproducible)
I You can’t always open up R, download a package, and get areasonable answer
I Real data is messy and always presents new complications
I Understanding why and how things work is a necessaryprecursor to figuring out what to do
31
![Page 33: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/33.jpg)
Reoccuring themes
Exact approach versus approximation: often when we can’t dosomething exactly, we’ll settle for an approximation. Can performwell, and scales well computationally to work for large problems
Bias-variance tradeoff: nearly every modeling decision is a tradeoffbetween bias and variance. Higher model complexity means lowerbias and higher variance
Interpretability versus predictive performance: there is also usuallya tradeoff between a model that is interpretable and one thatpredicts well under general circumstances
32
![Page 34: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/34.jpg)
There’s not a universal recipe book
Unfortunately, there’s no universal recipe book for when and inwhat situations you should apply certain data mining methods
Statistics doesn’t work like that. Sometimes there’s a clearapproach; sometimes there is a good amount of uncertainty in whatroute should be taken. That’s what makes it so hard, and so fun
This is true even at the expert level (and there are even largerphilosophical disagreements spanning whole classes of problems)
The best you can do is try to understand the problem, understandthe proposed methods and what assumptions they are making, andfind some way to evaluate their performances
33
![Page 35: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/35.jpg)
Hopefully you’re still awake
What do I need to know about the course?
34
![Page 36: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/36.jpg)
Course staff:
I Instructor: Rebecca Steorts (you can call me “Beka” or“Professor Steorts”, please not “Professor”)
I TAs: Abbas Zaidi and Yikun (Joey) Zhou
Why are you here?
I Because you love the subject, because it’s required, becauseyou eventually want to make $$$ ...
I No matter the reason, everyone can get something out of thecourse
I Work hard and have fun!
35
![Page 37: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/37.jpg)
Culture of the class
I Teaching you to fish (versus giving you one).I It’s amazing what a determined individual can learn from
documentation, small learning examples, and ... “gasp”Googling. And also stackoverflow.
I Rewarding engagement, intellectual generosity and curiosity.
I Speaking up, sharing success OR failure, showing someinterest in something will earn marks.
I Zero tolerance of plagiarism!I Generating your own ideas, your own code, and finding your
own way is a big reason you’re here. The process is much moreimportant than simply getting to the end point or product.
36
![Page 38: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/38.jpg)
Logistics/Grading
I Two lectures a week: concepts, methods, examples
I Lab to try stuff out and get fast feedback (10%)
I Participation, creativity in new ideas, sharing in your successesand failures, etc. (5%)
I HW weekly to do longer and more complex things (35%)
I Mid-term exam (25%)
I Final project in groups of 2-3, will be fun! (25%)
37
![Page 39: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/39.jpg)
Prerequisites:
I Assuming you know basic probability and statistics, linearalgebra, R programming (see syllabus for topics list)
Textbooks:
I Course textbook Introduction to Statistical Learning byJames, Witten, Hastie, and Tibshirani. Get it online athttp://www-bcf.usc.edu/~gareth/ISL.
I Bayesian Essentials with R, Marin and Robert, Please orderthis one (we won’t need it for about two weeks).
I More advanced textbook: Elements of Statistical Learning byHastie, Tibshirani, and Friedman. Also available online athttp://www-stat.stanford.edu/ElemStatLearn
38
![Page 40: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/40.jpg)
Markdown, RStudio, and LaTex
You must type all assignments and take home exams in Markdown(we will talk about submissions in class).All code must be written in RStudio.
https:
//guides.github.com/features/mastering-markdown/
https://rstudio-pubs-static.s3.amazonaws.com/18858_
0c289c260a574ea08c0f10b944abc883.html
39
![Page 41: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/41.jpg)
Turning in Your Own Work
I You may work together on homework. In fact, you should.You’ll learn a great deal.
I But.... All code, write ups, etc must be your own work andnot shared.
I All write ups must be your own work and not shared or copiedin any manner.
I All take home exams or projects that are not collaborative areto be your own work. You may not work together.
I If I find out that any assignment is not your own, you willreceive a 0 on the assignment and you will be reported to asper the university’s policy on cheating and plagiarism.
I You will submit all work online, which Abbas will explain.
40
![Page 42: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/42.jpg)
Setting up for success
1. R or RStudio
2. Intro to RStudio and Markdown: first lab.
3. git and bitbucket
41
![Page 43: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/43.jpg)
Quick intro to git
Download git and bitbucket.
1. git init
2. git add file.txt
3. git log
4. git status
5. git commit -a -m ”here are my changes.”
6. git push
You will understand more complicated git commands in your lab.(Branching, Merging, Uploading a file, etc.)
42
![Page 44: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/44.jpg)
Who am I
I Assistant prof and affiliated faculty at SSRI and iiD.
I Specialize in record linkage and dimension reduction methodsand algorithms for applications in human rights conflicts,official statistics, social networks, medical databases, andmany others.
I Methods I work on focus on Bayesian methods, machinelearning, and scalable algorithms (intensive computing).
I PhD in 2012 from University of Florida and finished VisitingAssistant Professorship at CMU in 2015.
I First semester at Duke and second time teaching the course,but changing many things! Very excited to be here.
43
![Page 45: Introduction to predictive modeling and data miningrcs46/lectures_2015/00-intro/00-intro.pdf · what situationsyou should apply certain data mining methods Statistics doesn’t work](https://reader036.fdocuments.in/reader036/viewer/2022071004/5fc0c8cb8c8c0a588200a914/html5/thumbnails/45.jpg)
Next time: RStudio, Markdown, and git.
44