Machine Learning: What we learned from our first Coursera ...minncas.org/docs/MinnCAS Jan 22nd -...

Post on 27-Mar-2018

221 views 3 download

Transcript of Machine Learning: What we learned from our first Coursera ...minncas.org/docs/MinnCAS Jan 22nd -...

Minnesota Casualty Actuarial Symposium

Machine Learning: What we learned from our first Coursera courseNathan Hubbell, Laura Johnson, Patrick Fillmore, Stephen Segroves

January 22nd, 2013

11

Agenda

1. MOOC Overview - Nathan

2. Machine Learning Concepts - Patrick

3. Machine Learning in Practice – Stephen

4. Other Learnings from Machine Learning - Laura

5. Q&A

MOOC OverviewNathan Hubbell

3

MOOC Overview

• MOOC – Massive Open Online Course• Big MOOC Names

– Feb, 2011: Udacity (Stanford)– April, 2012: Coursera (Stanford)– March, 2012 edX (Harvard, MIT, Berkeley)– Sept, 2006: Khan Academy

• Features– Open access– Scalability– Discussion Boards

3http://en.wikipedia.org/wiki/Massive_open_online_course

“Coursera doubles university count to 33, now hosts over 

200 courses for over 1.3 million students”

The Next Web InsiderSeptember 19th, 2012

4

MOOCs in the News

4

• On Udacity’s site:– “… But that seems to be a willful misreading of the regulation

(which seems silly in the first place). Coursera isn't a degree mill. It's not about earning the degree, it's about actually learning. Minnesota's interpretation of the law is fairly ridiculous. It basically means that anyone who wants to access online educational material in Minnesota is limited by the state determining what it considers okay."

• Slate.com: Larry Pogemiller, director of the MN Office of Higher Education:– “Obviously, our office encourages lifelong learning and wants

Minnesotans to take advantage of educational materials available on the Internet, particularly if they’re free. No Minnesotan should hesitate to take advantage of free, online offerings from Coursera.”

5

MOOCsperience

• Class Structure– 10 Week Course – 2-3 hours of video content per week– Wiki-based Course Notes– Questions? Discussion Forum

• Homework– Review Questions: Quick 5-question / 10 minutes– Programming Exercises: 1 – 4 hours

5

The following slides’ content are drawn heavily from theCoursera Machine Learning class content: https://www.coursera.org/#course/ml“

Machine Learning ConceptsPatrick Fillmore

7

What is Machine Learning?

-Tom Mitchell, American computer scientist and E. Fredkin University Professor at the Carnegie Mellon University

7

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its

performance on T, as measured by P, improves with experience E.”

TaskExperience Performance

8

What is Machine Learning?

8

TaskExperience Performance

Predict Future Losses

Policy LossHistory

Actual Losses / Loss Ratio

Ratemaking

Predict Future Development

LossDevelopment

History

Final / PredictedUltimates

Reserving

9

Machine Learning Techniques

• Familiar – Linear Regression / Linear Models– Logistic Regression / GLMs

TaskExperience Performance

• Not Machine Learning Algorithms– Judgmental selection of LDFs

– Risk/Reinsurance Models

• Unfamiliar– Supervised Learning

– Regularization– Neural Networks– Support Vector Machines

– Unsupervised Learning– Principal Component Analysis– Clustering– Recommender Systems

– Many More!

Data Driven Modeling

10

Linear Regression

10

Weight = Height * Factor + Intercept

Human Height vs. Weight

0

20

40

60

80

100

120

140

160

180

62 64 66 68 70 72 74 76

Height (Inches)

Wei

ght (

Poun

ds)

Hypothesis: 110)( xxhy

11

Linear Regression: Cost Function

11

110)( xxh Hypothesis:Human Height vs. Weight

0

20

40

60

80

100

120

140

160

180

62 64 66 68 70 72 74 76

Height (Inches)

Wei

ght (

Poun

ds)

Fitting Goal: minimize J

How to find a good fit?Cost Function!

Cost Function:

m

i

ii

mSSEyxh

mJ

1

2)()(10 2

)(21),(

Use Gradient Descent!

12

Linear Regression: Minimize Cost (Gradient Descent)

12

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0 1 2 3 4

Theta

Cos

t Fun

ctio

n

1. Start with a ; determine cost

Iteration J()

0 1.00 5,488,884

1

2

3

4

5

6

13

Linear Regression: Minimize Cost (Gradient Descent)

13

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1

2

3

4

5

6

2. Determine how J changes with  (dJ/d)

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0 1 2 3 4

Theta

Cos

t Fun

ctio

n

14

Linear Regression: Minimize Cost (Gradient Descent)

14

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1 2.65

2

3

4

5

6

3. Calculate a new 

New = Old ‐

ddJ

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0.00 1.00 2.00 3.00 4.00

Theta

Cos

t Fun

ctio

n

= learning rate = .01

15

Linear Regression: Minimize Cost (Gradient Descent)

15

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1 2.65 581,450 -52.98

2 3.18

3

4

5

6

4. Iterate until Convergence

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0.00 1.00 2.00 3.00 4.00

Theta

Cos

t Fun

ctio

n

16

Linear Regression: Minimize Cost (Gradient Descent)

16

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1 2.65 581,450 -52.98

2 3.18 77,351 -16.98

3 3.35

4

5

6

4. Iterate until Convergence

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0.00 1.00 2.00 3.00 4.00

Theta

Cos

t Fun

ctio

n

17

Linear Regression: Minimize Cost (Gradient Descent)

17

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1 2.65 581,450 -52.98

2 3.18 77,351 -16.98

3 3.35 25,569 -5.44

4 3.41

5

6

4. Iterate until Convergence

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0.00 1.00 2.00 3.00 4.00

Theta

Cos

t Fun

ctio

n

18

Linear Regression: Minimize Cost (Gradient Descent)

18

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1 2.65 581,450 -52.98

2 3.18 77,351 -16.98

3 3.35 25,569 -5.44

4 3.41 20,250 -1.74

5 3.42

6

4. Iterate until Convergence

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0.00 1.00 2.00 3.00 4.00

Theta

Cos

t Fun

ctio

n

19

Linear Regression: Minimize Cost (Gradient Descent)

19

Iteration J() dJ/d

0 1.00 5,488,884 -165.30

1 2.65 581,450 -52.98

2 3.18 77,351 -16.98

3 3.35 25,569 -5.44

4 3.41 20,250 -1.74

5 3.42 19,704 -0.56

6 3.43

4. Iterate until Convergence

-

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

0.00 1.00 2.00 3.00 4.00

Theta

Cos

t Fun

ctio

n

Final : 3.43

20

Cost Function – One Parameter

3 3.25 3.5 3.75 4

Theta

Cos

t Fun

ctio

n

21

Cost Function – Two Parameters

21

22

Linear Regression: GD vs. Normal Equations

22

Human Height vs. Weight

y = 3.4327x - 106.03

0

20

40

60

80

100

120

140

160

180

62 64 66 68 70 72 74 76

Height (Inches)

Wei

ght (

Poun

ds)

23

Why discuss Gradient Descent at all?

• Basic fitting algorithm for Machine Learning• Many other Systems/Models use Gradient Descent

Andrew Ng: If you understand gradient descent and can implement it, you can use optimized software to solve problems, and are ahead of many of the people working on this stuff in this field.

24

Neural Networks

Layer 1 Layer 2 Layer 3 Layer 4

25

Selecting Model Structure (The Right Machine for the Job)

25

Bias/variance: How would you fit this model?

Price

Size

26

Selecting Model Structure (The Right Machine for the Job)

26

Bias/variance: How would you fit this model?

Price

Size

27

Bias vs. Variance

27

High bias(underfit)

Price

Size

28

Bias vs. Variance

28

High bias(underfit)

Price

Size

High variance(overfit)

Price

Size

29

Bias vs. Variance

29

High bias(underfit)

Price

Size

High variance(overfit)

Price

Size

“Just right”

PriceSize

30

Cross Validation

30

DATA

Model

31

Cross Validation

31

ModelFit

Training Validation Testing (Holdout)

ModelStucture

FinalModel

Testing

32

Bias vs. Variance

32

High bias(underfit)

Price

Size

High variance(overfit)

Price

Size

“Just right”

PriceSize

33

Regularization

33

High variance(overfit)

Price

Size

Machine Learning in Practice: Cluster AnalysisStephen Segroves

Training set:

Supervised Learning

Training set:

Unsupervised Learning

Applications of Clustering

Market Segmentation / Customer Profiling

Territory Grouping

Social Network Analysis

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Clustering: K-Means Algorithm

Randomly initialize cluster centroidsRepeat {

for = 1 to := index (from 1 to ) of cluster centroid

closest to for = 1 to

:= average (mean) of points assigned to cluster

}

Clustering: K-Means Algorithm

Potential Issues: Local Optima

For i = 1 to 100 {

Randomly initialize K-means.Run K-means. Get .Compute cost function (distortion)

}

Pick clustering that gave lowest cost

Potential Solution: Local Optima

Other Learnings from the Machine Learning CourseLaura Johnson

This course was a great way to learn – WHY?• Structure and foundation given

– 58,000 students across the world across multiple disciplines• Well laid out web site• Discussion forums, wikis, etc.

– Basic building blocks provided

• Technical enhancements to recorded sessions– Notes – color!!– Captions / transcript– Speed control– “interactive” feedback

51

Coursera Look and Feel - Structure

52

Coursera – Teaching using Building Blocks

Coursera Technical Enhancements – Notes, Captions, Speed

54

Neural Networks

Layer 1 Layer 2 Layer 3 Layer 4

Coursera Technical Enhancments – Notes in Color!!

56

Coursera Technical Enhancments - Feedback

Coursera Technical Enhancements - Feedback

Machine Learning MOOC Recommendations• Time

– Only take one MOOC at a time!– Do the homework on time

• Software Required– Google Chrome, Firefox, IE9– Octave (Free Matlab)– Text Editor (UltraEdit, SublimeText, TextWrangler)

• Suggested Prerequisites– Linear Algebra– Some Programming Experience a Plus

• Team up!• Final comments on Machine Learning:

– Data: GIGO– Half science / half art

59

Questions?