CS145: INTRODUCTION TO DATA MINING ...
Embed Size (px)
Transcript of CS145: INTRODUCTION TO DATA MINING ...

CS145: INTRODUCTION TO DATA MINING
Instructor: Yizhou [email protected]
October 8, 2018
2: Vector Data: Prediction
mailto:[email protected] 
TA Office Hour Time Change Junheng Hao: Tuesday 13pmYunsheng Bai: Thursday 13pm
2

Methods to Learn
3
Vector Data Set Data Sequence Data Text Data
Classification Logistic Regression; Decision Tree; KNNSVM; NN
Nave Bayes for Text
Clustering Kmeans; hierarchicalclustering; DBSCAN; Mixture Models
PLSA
Prediction Linear RegressionGLM*
Frequent Pattern Mining
Apriori; FP growth GSP; PrefixSpan
Similarity Search DTW

How to learn these algorithms?Three levels
When it is applicable? Input, output, strengths, weaknesses, time
complexity
How it works? Pseudocode, work flows, major steps Can work out a toy problem by pen and paper
Why it works? Intuition, philosophy, objective, derivation, proof
4

Vector Data: Prediction
Vector Data
Linear Regression Model
Model Evaluation and Selection
Summary
5

Example
6
npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11xA matrix of n :
n data objects / points p attributes / dimensions

Attribute TypeNumerical
E.g., height, income
Categorical / discrete E.g., Sex, Race
7

Categorical Attribute Types Nominal: categories, states, or names of things
Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation, ID numbers, zip codes
Binary Nominal attribute with only 2 states (0 and 1) Symmetric binary: both outcomes equally important
e.g., gender Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative) Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal Values have a meaningful order (ranking) but magnitude between
successive values is not known. Size = {small, medium, large}, grades, army rankings
8

Basic Statistical Descriptions of Data
Central Tendency Dispersion of the Data Graphic Displays
9

Measuring the Central Tendency Mean (algebraic measure) (sample vs. population):
Note: n is sample size and N is population size.
Weighted arithmetic mean:
Trimmed mean: chopping extreme values
Median:
Middle value if odd number of values, or average of the middle two values otherwise
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula:
Nx=
=
=n
iixn
x1
1
=
== n
ii
n
iii
w
xwx
1
1
)(3 medianmeanmodemean =
10

Symmetric vs. Skewed Data Median, mean and mode of
symmetric, positively and negatively skewed data
positively skewed negatively skewed
symmetric
11

Measuring the Dispersion of Data Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Interquartile range: IQR = Q3 Q1 Five number summary: min, Q1, median, Q3, max
Outlier: usually, a value higher/lower than 1.5 x IQR of Q3 or Q1
Variance and standard deviation (sample: s, population: )
Variance: (algebraic, scalable computation)
2 = [ 2] = 2 2
Standard deviation s (or ) is the square root of variance s2 (or2)
= ==
=
=n
i
n
iii
n
ii xn
xn
xxn
s1 1
22
1
22 ])(1[1
1)(1
1
12
Box Plot

Graphic Displays of Basic Statistical Descriptions
Histogram: xaxis are values, yaxis repres. frequencies
Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
13

Histogram Analysis
Histogram: Graph display of tabulated frequencies, shown as bars
It shows what proportion of cases fall into each of several categories
Differs from a bar chart in that it is the area of the bar that denotes the value, not the height as in bar charts, a crucial distinction when the categories are not of uniform width
The categories are usually specified as nonoverlapping intervals of some variable. The categories (bars) must be adjacent
0
5
10
15
20
25
30
35
40
10000 30000 50000 70000 9000014

Scatter plot Provides a first look at bivariate data to see clusters of points,
outliers, etc Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
15

Positively and Negatively Correlated Data
The left half fragment is positively
correlated
The right half is negative correlated
16

Uncorrelated Data
17

Scatterplot Matrices
Matrix of scatterplots (xydiagrams) of the kdim. data [total of 2 + unique scatterplots]
18
Use
d by
erm
issi
on o
f M. W
ard,
Wor
cest
er P
olyt
echn
icIn
stitu
te

Vector Data: Prediction
Vector Data
Linear Regression Model
Model Evaluation and Selection
Summary
19

Linear RegressionOrdinary Least Square Regression
Closed form solution Gradient descent
Linear Regression with Probabilistic Interpretation
20

The Linear Regression ProblemAny Attributes to Continuous Value: x y
{age; major ; gender; race} GPA
{income; credit score; profession} loan
{college; major ; GPA} future income ...
21

Example of House PriceLiving Area (sqft) # of Beds Price (1000$)
2104 3 400
1600 3 330
2400 3 369
1416 2 232
3000 4 540
22
x=(1, 2) y
= 0 + 11 + 22

Illustration
23

Formalization Data: n independent data objects
, = 1, ,
= 1, 2, , T, i = 1, ,
A constant factor is added to model the bias term, i. e. , 0 = 1
New x: = 0, 1, 2, , T
Model: : dependent variable : explanatory variables = 0,1, ,
: = = 0 + 11 + 22 + +
24

A 3step ProcessModel Construction
Use training data to find the best parameter , denoted as
Model Selection Use validation data to select the best model
E.g., Feature selection
Model Usage Apply the model to the unseen data (test data): =
25

Least Square EstimationCost function (Mean Square Error):
= 12
2 /
Matrix form: = X ( )/2
or X 2/2
26: + matrix
npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x
,1
,1
,1 1
y:

Ordinary Least Squares (OLS) Goal: find that minimizes
= 12
X
= 12
( + )
Ordinary least squares Set first derivative of as 0
= ( )/ = 0
= 1
More about matrix calculus: https://atmos.washington.edu/~dennis/MatrixCalculus.pdf
27
Z

Gradient DescentMinimize the cost function by moving down in the steepest direction
28

Batch Gradient Descent Move in the direction of steepest descend
Repeat until converge {
}
29
(+1):=(t)
=(t) ,
Where = 12
2/ = ()/ and
=
/ =
/
e.g., = 0.01

Stochastic Gradient Descent When a new observation, i, comes in, update weight immediately (extremely useful for largescale datasets):
Repeat {for i=1:n {
(+1):=(t) + ( ())}
}
30
If the prediction for object i is smaller than the real value, should move forward to the direction of

Probabilistic Interpretation Review of normal distribution
X~ ,2 = = 122
2
22
31

Probabilistic Interpretation Model: = +
~(0,2)  ,~(,2)
= Likelihood:
= ,)
= 1
22exp{
2
22}
Maximum Likelihood Estimation find that maximizes L arg max = arg min , Equivalent to OLS!
32

Other Practical Issues Handle different scales of numerical attributes
Zscore: =
x: raw score to be standardized, : mean of the population, : standard deviation
What if some attributes are nominal? Set dummy variables
E.g., = 1, = ; = 0, = Nominal variable with multiple values?
Create more dummy variables for one variable
What if some attribute are ordinal? replace xif by their rank map the range of each variable onto [0, 1] by replacing ith object in
the fth variable by =11
33
},...,1{ fif Mr
Type equation here.

Other Practical IssuesWhat if is not invertible?
Add a small portion of identity matrix, , to it ridge regression or linear regression with l2 norm
What if nonlinear correlation exists? Transform features, say, to 2
34

Vector Data: Prediction
Vector Data
Linear Regression Model
Model Evaluation and Selection
Summary
35

Model Selection Problem Basic problem:
how to choose between competing linear regression models
Model too simple: underfit the data; poor predictions; high bias; low
variance Model too complex:
overfit the data; poor predictions; low bias; high variance
Model just right: balance bias and variance to get good predictions
36

Bias and Variance Bias: ( ) ()
How far away is the expectation of the estimator to the true value? The smaller the better.
Variance: = [ 2
] How variant is the estimator? The smaller the better.
Reconsider mean square error
/ = 2
/ Can be considered as
[ () 2
] = 2 + +
37Note = 0, = 2
True predictor : Estimated predictor :

BiasVariance Tradeoff
38

Example: degree d in regression
39
http://www.holehouse.org/mlclass/10_Advice_for_applying_machine_learning.html

Example: regularization term in regression
40

CrossValidationPartition the data into K folds
Use K1 fold as training, and 1 fold as testing Calculate the average accuracy best on K trainingtesting pairs Accuracy on validation/test dataset!
Mean square error can again be used: 2
/
41

AIC & BIC*AIC and BIC can be used to test the quality of statistical models AIC (Akaike information criterion)
= 2 2ln(), where k is the number of parameters in the model
and is the likelihood under the estimated parameter
BIC (Bayesian Information criterion) B = () 2ln(), Where n is the number of objects
42

Stepwise Feature Selection Avoid bruteforce selection
2
Forward selection Starting with the best single feature Always add the feature that improves the performance
best Stop if no feature will further improve the performance
Backward elimination Start with the full model Always remove the feature that results in the best
performance enhancement Stop if removing any feature will get worse performance
43

Vector Data: Prediction
Vector Data
Linear Regression Model
Model Evaluation and Selection
Summary
44

Summary What is vector data?
Attribute types Basic statistics Visualization
Linear regression OLS Probabilistic interpretation
Model Evaluation and Selection BiasVariance Tradeoff Mean square error Crossvalidation, AIC, BIC, stepwise feature selection
45
CS145: Introduction to Data MiningTA Office Hour Time ChangeMethods to LearnHow to learn these algorithms?Vector Data: PredictionExampleAttribute TypeCategorical Attribute Types Basic Statistical Descriptions of DataMeasuring the Central Tendency Symmetric vs. Skewed DataMeasuring the Dispersion of DataGraphic Displays of Basic Statistical DescriptionsHistogram AnalysisScatter plotPositively and Negatively Correlated Data Uncorrelated DataScatterplot MatricesVector Data: PredictionLinear RegressionThe Linear Regression ProblemExample of House PriceIllustrationFormalizationA 3step ProcessLeast Square EstimationOrdinary Least Squares (OLS)Gradient DescentBatch Gradient DescentStochastic Gradient DescentProbabilistic Interpretation Probabilistic InterpretationOther Practical IssuesOther Practical IssuesVector Data: PredictionModel Selection ProblemBias and VarianceBiasVariance TradeoffExample: degree d in regressionExample: regularization term in regressionCrossValidationAIC & BIC*Stepwise Feature SelectionVector Data: PredictionSummary