Interacting with Data

41
1 Interacting with Data Materials from a Course in Princeton University -- Hu Yan

description

Interacting with Data. Materials from a Course in Princeton University. -- Hu Yan. Outline. Introduction to this course Introduction to Classification The Nearest Neighbor Algorithm Decision Tree Algorithm Conclusion and future talks. What is this course about?. - PowerPoint PPT Presentation

Transcript of Interacting with Data

Page 1: Interacting with Data

1

Interacting with Data

Materials from a Course in Princeton University

-- Hu Yan

Page 2: Interacting with Data

2

Outline

Introduction to this course Introduction to Classification The Nearest Neighbor Algorithm Decision Tree Algorithm Conclusion and future talks

Page 3: Interacting with Data

3

What is this course about? This course is about data!

how to get the most out of data and convert data into knowledge, information or predictions.

Examples of the datasets credit cards: every purchase you make is tracked,

used to detect fraud, marketing purposes, make predictions

security cameras: used for tracking (enforce fine), or finding criminals via facial recognition software.

articles: articles are indexed in multiple databases, organize articles by topics or even track the evolution of topics over time.

There are all kinds of data Text, images, transaction records, etc.

Page 4: Interacting with Data

4

Tasks make predictions or classifications

classify customers whether or not switch companies cluster or organize data

cluster articles by topic different from classification: don’t know the classes

ahead of time find “simple” descriptions of complex objects

find a simple description of faces identify what is typical and what is an outlier

identify purchases that are typical or unusual for a given customer

Page 5: Interacting with Data

5

Perspective Related fields

Pattern recognition (from 60s) primarily concerns with images Machine learning (from 80s) was a natural outgrowth of Artificial

Intelligence (AI) Data mining (from 90s) in order to deal with the vast amounts of

data to discover “interesting patterns” This course is largely a mixture of statistics, machine

learning, and data mining Look at interacting with data:

Classification, clustering, regression, and dimensionality reduction

Page 6: Interacting with Data

6

Outline

Introduction to this course Introduction to Classification The Nearest Neighbor Algorithm Decision Tree Algorithm Conclusion and future talks

Page 7: Interacting with Data

7

Introduction to Classification Classifying objects from a data set based on a certain

characteristic. Binary classification: positive or negative.

Classification learning algorithm Input: labeled data sets Output: classifier (predict the label of input unclassified

examples)

Page 8: Interacting with Data

8

Example

classification criterion:

any integer greater than 196 or less than 47 will be labeled negative, and positive otherwise.

Page 9: Interacting with Data

9

Example

a decimal integer is positive if the second and sixth most significant bits in its binary representation are set; it’s negative otherwise.

Page 10: Interacting with Data

10

Outline

Introduction to this course Introduction to Classification The Nearest Neighbor Algorithm Decision Tree Algorithm Conclusion and future talks

Page 11: Interacting with Data

11

The Nearest Neighbor Algorithm Training

There are m training examples. Each training example is of the form (xi, yi), where xi \

in Rn and yi \in {v1, …, vs}.

Store all the training examples.

Testing. Given a test point x, predict yi where xi is the closest

training example to x.

Page 12: Interacting with Data

12

The Nearest Neighbor Algorithm is a kind of Instance-based learning methods. referred as to “lazy” learning methods.

Simply store the training examples, delay processing until a new instance must be classified

Some methods construct a general, explicit description of the target function when training examples are provided

advantage: instead of estimating the target function once for the entire space, estimate it locally and differently for each new instance.

disadvantage: the cost of classifying new instances can be high.

Page 13: Interacting with Data

13

k-Nearest Neighbor Algorithm

Page 14: Interacting with Data

14

k-Nearest Neighbor Algorithm

Two-dimensional space Positive, negative 1-nearest neighbor: xq +

5-nearest neighbor: xq -

Page 15: Interacting with Data

15

k-Nearest Neighbor Algorithm Never forms an explicit general hypothesis f^

regarding the target function f Simply computes the classification of each new

query instance as needed What’s the implicit general function?

Page 16: Interacting with Data

16

Distance-weighted k-Nearest Neighbor Algorithm Obvious refinement

Weight the contribution of each of the k neighbors according to their distance to the query point.

Page 17: Interacting with Data

17

Curse of dimensionality

Imagine instances described by 20 attributes but only 2 are relevant to target function

Curse of dimensionality nearest neighbor is easily mislead when high-dimensional

One approach Stretch jth axis by weight zj where z1, …, zn chosen to

minimize prediction error Use cross-validation to automatically choose weights

z1, …, zn Note setting zj to zero eliminates this dimension

altogether

Page 18: Interacting with Data

18

Outline

Introduction to this course Introduction to Classification The Nearest Neighbor Algorithm Decision Tree Algorithm

Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting

Conclusion and future talks

Page 19: Interacting with Data

19

Decision tree for PlayTennis

Page 20: Interacting with Data

20

Decision tree representation Instances are represented by attribute-value pairs Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification In general, decision tree represent a disjunction of

conjunctions of constraints on the attribute values of attributes tests.

Page 21: Interacting with Data

21

Building a Decision tree ID3 (1986), C4.5 (1993) A top-down, greedy search through the space of

possible decision trees. Main loop:

A the best decision attribute for next node; Assign A as decision attribute for node; For each value of A create new branch of node; Sort training examples to leaf nodes; If training examples perfectly classified Then STOP

Else iterate over new leaf nodes;

Page 22: Interacting with Data

22

Entropy S is a sample of training examples p+ is the proportion of positive examples in S

p - is the proportion of negative examples in S Entropy measures the impurity of S

2 2 ( ) log logEntropy s p p p pHere

Entropy ([9+,5-]) = -(9/14)log2(9/14) - (5/14)log2(5/14) = 0.940

21( ) log

c

i iiEntropy s p p

Page 23: Interacting with Data

23

Entropy function

Entropy(S) = expected number of bits needed to encode class (+ or -) of randomly drawn member of S (under the optimal shortest-length code)

2 2

( )

log log

Entropy s

p p p p

Page 24: Interacting with Data

24

Information Gain

Page 25: Interacting with Data

25

Information Gain

S is a collection of training example days described by attributes including Wind, which have the values Weak and Strong.

S contains 14 examples, [9+, 5-] 6 of the positive and 2 of the negative examples have Wind =

Weak, and the remainder have Wind = Strong.

Page 26: Interacting with Data

26

Training Examples

Page 27: Interacting with Data

27

Which Attribute Is the Best Classifier Information gain is the measure used by ID3 to select the best attribute at

each step in growing the tree. Example: information gain of two attributes: Humidity, and Wind, is

computed to determine witch is better for classifying the training examples.

Page 28: Interacting with Data

28

An Illustrative ExampleGain(S,Outlook) = 0.245 Gain(S,Humidity) = 0.151Gain(S,Wind) = 0.048 Gain(S,Temp) = 0.029

Which attributes should be tested here?

Page 29: Interacting with Data

29

Selecting the Next Attribute

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny, Humidity) =0 .970

Gain (Ssunny, Temp) = 0.570

Gain (Ssunny, Wind)= 0.0191,2,8,9,11 3,7,12,13 4,5,6,10,14

9,111,2,8

Page 30: Interacting with Data

30

Hypothesis Space Search by ID3

ID3 search through the space of possible decision trees from simple to increasingly complex, guided by the information gain Gain(S,A)

Page 31: Interacting with Data

31

Hypothesis Space Search by ID3 ID3 searches a complete hypothesis space, it

searches incompletely through the space. Outputs a single hypothesis; No back tracking, converging to Local optimal

solution (maybe not global optimal); Using statistical properties, robust to noisy data; Inductive bias: Preference for short trees and for

those with high information gain attributes near the root

Page 32: Interacting with Data

32

Day Temp Humidity Wind Play

D1 Cool High Weak No

D2 Cool Normal Weak No

D3 Hot High Strong No

D4 Hot Normal Weak Yes

D5 Cool Normal Strong Yeshumid

temp

Ywind

N

1,3highnormal

2,4,5

52

hot2, 5

cool4

N Y

strongweak

humidtemp

Y

wind

N

3

highnormal

5

3,51,2,4

hot1,2

cool

4

N Y

strongweak

[2+,3-]

[1+,2-] [1+,1-]

[2+,3-]

[2+,1-]

[1+,1-]

Gain(s) = -2/5 log22/5 – 3/5 log23/5 = 0.971

Gain(S,humid) = 0.405

Gain(S,wind) = Gain(S,Temp) =0.805

Page 33: Interacting with Data

33

Overfitting in Decision Tree LearningConsider error of hypothesis h over training data errortrain(h)

entire distribution D of data errorD(h)

Hypothesis h \in H overfits training data if there is

an alternative hypothesis h’ \in H such that

errortrain(h) < errortrain(h’) AND errorD(h) >errorD(h’)

Page 34: Interacting with Data

34

Overfitting in Decision Tree Learning

Page 35: Interacting with Data

35

Avoiding Overfitting How can we avoid overfitting?

stop growing before it reaches the point where it perfect classifies the training dada

grow full tree then post-prune (widely used)

How to select best tree during the pruning? Split data into training and validation set build decision tree over training data measure performance over separate validation data set

Two ways of pruning: reduced-error pruning rule post pruning

Page 36: Interacting with Data

36

Reduced Error Pruning

1. Split data into training and validation set

2. Build the tree over training data

3. For each of the decision node Evaluate impact on validation set of pruning each

decision node remove the one that improves validation set accuracy

removing the subtree rooted at that node, making it a leaf node

assigning it the most common classification of the training examples affiliated with that node

Page 37: Interacting with Data

37

Effect of Reduced-Error Pruning

Page 38: Interacting with Data

38

Rule Post Pruning1. Convert tree to equivalent set of rules (if-then

expression)2. Prune (generalize) each rule by removing any

preconditions that improves its estimated accuracy 3. Sort the pruned rules by their estimated accuracy

(can be used in classifying subsequent instances)

IF (Outlook= Sunny) and (Humidity = High) THEN PlayTennis = No

IF (Outlook = Sunny) and (Humidity = Normal) THEN PlayTennis = Yes

….

Page 39: Interacting with Data

39

Conclusion Interacting with data

how to get the most out of data and convert data into knowledge, information or predictions

Classification, clustering, regression, and dimensionality reduction

Classification categorize objects into particular classes based on

their attributes

The Nearest Neighbor Algorithm Decision Tree Algorithm

Page 40: Interacting with Data

40

Contents Classification

K-nearest-neighbor algorithm, Decision trees Computational learning theory Boosting, Support vector machines

Clustering K-means clustering, Agglomerative clustering

Graphic Models (a marriage of probability theory and graph theory) Naive Bayes classification, EM (Expectation-Maximization) algorithm

Regression (predict a real value quantity based on observed data) Linear regression, Logistic regression

Dimensionality Reduction (reduce the representation of data) PCA (Principal Components Analysis), Factor analysis

Advanced Topics and Applications

Page 41: Interacting with Data

41

Thank you !