Statistics 202: Data Mining - Week 1 - Stanford...
Transcript of Statistics 202: Data Mining - Week 1 - Stanford...
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Statistics 202: Data MiningWeek 1
c©Jonathan TaylorBased in part on slides from textbook, slides of Susan Holmes
October 7, 2011
1 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Part I
Introduction
2 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data Mining
What is data mining?
Non-trivial extraction of implicit, previously unknown andpotentially useful information from data
Data mining involves the use of sophisticated data analysistools to discover previously unknown, valid patterns andrelationships in large data sets.
A key feature of data mining is that the data sets arelarger than those encountered in “classical” statistics. Solarge that it must be (semi-)automated.
3 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data Mining
Who uses data mining?
Industry:1 Netflix2 Amazon3 Google (i.e. google trends)
Science:1 Genomics2 Climate Science3 Astrophysics4 Neuroimaging
4 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Netflix
5 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Amazon
See larger image
Share your own customer images
Publisher: learn how customers can search inside thisbook.
+
Hello, Jonathan Taylor. We have recommendations for you. (Not Jonathan?) FREE Two-Day Shipping: See details
Jonathan's Amazon.com | Today's Deals | Gifts & Wish Lists | Gift Cards Your Digital Items | Your Account | Help
Search Books
Books AdvancedSearch
BrowseSubjects
NewReleases
BestSellers
The New YorkTimes® Bestsellers
Libros enespañol
BargainBooks Textbooks
Introduction to Data Mining [Hardcover]Pang-Ning Tan (Author), Michael Steinbach (Author), VipinKumar (Author)
(18 customer reviews) | (3)
List Price: $120.00
Price: $94.50 & this item ships for FREE withSuper Saver Shipping. Details
You Save: $25.50 (21%)
In Stock.Ships from and sold by Amazon.com. Gift-wrap available.
Want it delivered Tuesday, September 27? Order it in thenext 20 hours and 22 minutes, and choose One-Day Shipping atcheckout. Details
32 new from $94.50 20 used from $55.00
FREE Two-Day Shipping for Students. Learn more
Frequently Bought TogetherCustomers buy this book with Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The MorganKaufmann Series in Data Management Systems) by Ian H. Witten Paperback $39.50
Price For Both: $134.00
Show availability and shipping details
Shop All Departments Cart Wish List
Yes, I want FREE Two-DayShipping with Amazon Prime
Quantity: 1
or
Sign in to turn on 1-Click ordering.
More Buying Choices
52 used & new from $55.00
Have one to sell? or
Get a $62.20 Amazon Gift Card
Share
Tell the Publisher!I'd like to read this book on Kindle
Don't have a Kindle? Get your Kindlehere, or download a FREE KindleReading App.
Formats AmazonPrice
Newfrom
Usedfrom
Hardcover $94.50 $94.50 $55.00
Paperback -- -- $84.93
Sell Back Your Copy for $62.20Whether you buy it used on Amazon for $55.00 or somewhere else, you can sell it backthrough our Book Trade-In Program at the current price of $62.20 through December 20,2011. Restrictions Apply
Customers Who Bought This Item Also Bought Page 1 of 11
Data Mining: PracticalMachine Learning Toolsan... by Ian H. Witten
(13)
$39.50
The Elements ofStatistical Learning:Data Minin... by TrevorHastie
(45)
$61.32
Programming CollectiveIntelligence: BuildingSma... by Toby Segaran
(69)
$26.39
Data Mining: Conceptsand Techniques, ThirdEdition... by Jiawei Han
(4)
$60.12
Amazon.com: Introduction to Data Mining (9780321321367): ... http://www.amazon.com/Introduction-Data-Mining-Pang-Ning...
1 of 7 9/25/11 8:07 PM
6 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Google Trends
[email protected] | Sign out
andrew luck Search Trends Tip: Use commas to compare multiple search terms.
Searches Websites All regions All years
- Scale is based on the average worldwide traffic of andrew luck in all years. Learn more- An improvement to our geographical assignment was applied retroactively from 1/1/2011. Learn more
andrew luck 1.00
Rank by andrew luck
Interception caps tough day for Stanford's Andrew LuckSan Jose Mercury News - Nov 22 2009
Andrew Luck outplays Jake Locker as No. 13 Stanford dominates Washington 41-0Los Angeles Times - Oct 31 2010
Cam Newton wins Heisman Trophy over Andrew Luck, LaMichael James, Kellen Moore in New YorkNew York Daily News - Dec 12 2010
Andrew Luck leads Stanford past Va Tech 40-12Fox News - Jan 4 2011
Andrew Luck, No. 7 Stanford roll past San Jose State 57-3 in season openerWashington Post - Sep 4 2011
Andrew Luck throws for 325 yards as Stanford rolls ArizonaESPN - Sep 18 2011
More news results »
Regions
1. United States
2. Canada
3. Australia
4. United Kingdom
Cities
1. Stanford, CA, USA
2. Charlotte, NC, USA
3. San Francisco, CA, USA
4. Houston, TX, USA
5. San Jose, CA, USA
6. Herndon, VA, USA
7. Austin, TX, USA
8. Raleigh, NC, USA
9. Pleasanton, CA, USA
10. Seattle, WA, USA
Languages
1. English
2. Spanish
Export this page as a CSV file
Google Trends provides insights into broad search patterns. Please keep in mind that several approximations are used when computing these results.
©2008 Google - Discuss - Terms of Use - Privacy Policy - Help
Google Trends: andrew luck http://www.google.com/trends?q=andrew+luck&ctab=0&geo=a...
1 of 1 9/25/11 8:09 PM
7 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Genomics
8 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Neuroimaging
9 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Climate science
10 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data Mining
Some things that are not data mining
Looking up a record in a database by identifier such as lastname . (No pattern is revealed by this lookup . . . )
Searching for “Amazon” on google. (Google has donesome data mining, but you have not . . . )
Testing a two-sample hypothesis in a clinical trial. (Dataset is often not large and unstructured.)
11 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data Mining
Some things that are more like data mining
Noting that some last names occur in certain geographicalareas.
Taking all query results from google on Amazon anddiscovering that there are at least two groups: “Amazonriver” and “Amazon.com”
When doing multiple tests across many different genes,identifying very strongly significant genes . . .
12 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data Mining
Prediction / Supervised Problems
In such problems there is an outcome or label we want topredict based on many features.
Classification
Regression
Outlier detection
13 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data Mining
Descriptive / Unsupervised Problems
In such problems, we are seeking to discover hidden “structure”in the data, without an outcome or label.
Clustering
Dimension Reduction
Association Rules
Semisupervised problems
A mix of labelled and unlabelled data is used.
14 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Part II
Examples
15 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
A canonical example
Voting records house of representatives
On http://clerk.house.gov, one can view all roll call votesfor many years.
We might hypothesize that there is some structure tothese votes: the votes should cluster roughly by party.
How can we find this information out fromhttp://clerk.house.gov?
16 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Scraping the data
Any practical problem like this requires taking the data offthe web.
This can be a large part of the time spent in such a datamining task.
17 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Example: 2010, Vote # 134
18 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Scraping the data
There were 664 roll call votes in the House ofRepresentatives during 2010.
At any given time, there can be up to 435 members in theHouse of Representatives.
In an idealized setting, this means there exists a matrix
XXX 435×664
of votes.
What are the entries of the matrix? They could be strings:“Aye”/“Nay”, or numbers (e.g. “Aye”→ 1, “Nay”→ −1,“Not Voting”→ 0)
19 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
In practice
Not all members voted on all bills. Some members resign(e.g. Anthony Weiner, etc.)
In the data set I scraped, 427 members had a recordedvote for all bills.
20 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Forming a “model”
Our hypothesis is that for each row XXX i of XXX we might beable to decide the party label of the i-th representative.
This rule must be expressed as a function f (XXX i ).
What rules to use?
A function (or rule, if you prefer) that is constant forDemocrats and Republicans cannot be useful.
An example of such a bad rule is the number of recordedvotes for each person in the dataset – this is just 664.
21 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Forming rules
Given we have our data matrix XXX , we can form many rulesby taking linear combinations across the {−1, 0, 1} values.
Each β ∈ R664, determines a linear rule
fβ(xxx) = xxxTβ
But, we already decided that constant rules areuninteresting. So, we want rules that vary a lot across thedata.
22 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Principal Component Analysis
Consider the following abstract problem
maximizeβ“Variability”(fβ)
For our linear rules, if you scale β, the variability scales.So we should constrain ‖β‖ = 1.
With this constraint, and “Variability” being samplevariance we can actually solve this problem.
The rule with the maximum sample variance is the leadingeigenvector of
XXXTHXXX
where H is the mean-removing matrix (let’s not worryabout exactly what it is at the moment).
23 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
PCA on a sample of 20 votes
24 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
PCA on a sample of 20 votes, labelled
25 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
PCA on a sample of 100 votes, labelled
26 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
PCA on another sample of 20 votes, labelled
27 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
PCA on another sample of 20 votes, labelled
28 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Screeplot
29 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Predicting party from a rule
Now, we have this new rule fβ̂1where β̂1 was the most
“variable” among the linear rules.
It seems that if we just guess party based on the sign ofthis rule, we will do pretty well . . .
30 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Classification
This is an example of a (linear) classifier, which is a rule cthat takes a set of votes xxx and assigns a label R,D.
We can express this thresholding rule for a given β interms of
signR(β) = sign(meani∈sample(R)fβ(XiXiXi )
)The rule is now
cβ(xxx) =
{R sign(fβ(xxx)) = signR(β)
D otherwise.
31 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Classification accuracy
Given a label yyy we can ask how accurate the rule is:
L(xxx ,yyy , c) =
{1 c(xxx) 6= yyy
0 otherwise.
Not surprisingly, when we apply this rule to our sample of427 members, we do pretty well.
That is, if YYY is the vector of {D,R} labels, the followingquantity is pretty small
427∑i=1
L(XXX i ,YYY i , cβ̂1).
32 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Validation
However, we used the information on all 427 to find β̂1.Could this be cheating?
We can validate this procedure of feature extractionfollowed by classification by finding β̂1 based on only, say,400 members.
Having found β̂1 on the data from these 400 members wecan then predict the party of the remaining 27 members.
See http://stats202.stanford.edu/voting.html.
33 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Voting records
Summary
Preprocessing Scraping the data off the web. This can be timeconsuming!
Feature extraction We derived interesting linear rules bylooking at votes, XXX but not labels YYY . This is aunsupervised task.
Classification Having found interesting linear rules, we usedthem to predict the labels YYY . This is supervisedtask.
Validation In order to validate the Feature Extraction /Classification steps together, we extractedfeatures based on a random subset of 400members and predicted the labels of theremaining 27. This is possible in supervisedproblems, generally not in unsupervised problems.
34 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Part III
Types of data
35 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data
What is data?
Data comes in many forms.
The simplest one to deal with is a “flat file”, or datamatrix / data frame:
XXX n×p
Usually, n is the number of “cases”, p is the number of“features”.
Example: if we record height, weight, GPA on 400Stanford undergraduates, we would represent this asXXX 400×3.
36 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data
Continuous variables
Our previous example had each feature being numeric.Not all data is numeric.
If we add major to our data set, then we have acategorical or discrete variable.
Many categorical variables are unordered but some areordered.
For example, if we followed our 400 undergrads 10 yearsout into careers, we might make annual income brackets:
(50K , 100K ], (100K , 150K ], (150K , 200K ], etc.
37 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Example from textbook
38 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data
Types of data / attributes
Nominal Examples from text: ID numbers, eye color, zipcodes
Ordinal Examples from text: rankings (e.g., taste ofpotato chips on a scale from 1-10), grades, heightin {tall, medium, short}
Interval Examples from text: calendar dates, temperaturesin Celsius or Fahrenheit.
Ratio Examples from text: temperature in Kelvin,length, time, counts
39 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data
Properties of data / attributes
Distinctness =, 6=Order <,>
Addition +,−Multiplication ∗, /
40 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data
Discrete vs. continuous
Discrete Attribute Has only a finite or countably infiniteset of valuesExamples from text: zip codes, counts, orthe set of words in a collection ofdocuments, binary data.
Continuous Attribute Real numbers for values.Examples from text: temperature, height, orweight.Floating point representation in thecomputer.
41 / 1
Statistics 202:Data Mining
c©JonathanTaylor
Based inpart onslidesfromtext-book,
slides ofSusan
Holmes
Data
2010 U.S. Congress example
Let’s look at the variables we used in our previous example.
party: discrete with two values {D,R}vote1, vote2, : discrete with values{Yay,Nay,Aye,Present,Not Voting, No}numeric vote1, numeric vote2, : discrete with values{−1, 0, 1}pca.votes$scores: continuous
This transformation from discrete variables vote1, vote2 tocontinuous variables pca.votes$scores is a recurring one inapplied statistics . . . Let’s take a look at *R*
42 / 1