2017 Predictive Analytics Symposium - Member | SOA Predictive Analytics Symposium . ... Conceptual...
Transcript of 2017 Predictive Analytics Symposium - Member | SOA Predictive Analytics Symposium . ... Conceptual...
2017 Predictive Analytics Symposium
Session 13, Getting Started: Sources of Tools and Training
Moderator: Min Mercer, FSA, MAAA
Presenters:
Mary Pat Campbell, FSA, MAAA Michael Cletus Niemerg, FSA, MAAA
SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer
www.conning.com | © 2017 Conning, Inc.
Getting Started: Sources of Tools and TrainingResources for Predictive Analytics in R
Mary Pat Campbell, FSA, MAAA
September 2017
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
PREDICTIVE ANALYTICS IN RBeginner Resources
1
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
INTRO TO STATISTICAL LEARNINGOnline course, textbook, and R exercises
2
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Online Course – Hands On, Simple Examples
3
Class link: https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/
Stanford Online Lagunita: Statistical Learning – Self-Paced Course
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Multimedia Approach, Convenient to Try Offline
4
1.1 Opening Remarks
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Quick Quizzes to Test Understanding
5
Chapter 7 Quiz: Moving Beyond Linearity
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Instant Feedback and Explanation
6
Chapter 7 Quiz: Moving Beyond Linearity
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Some Questions are Easier than Others...
7
Chapter 4.1: Introduction to Classification Problems
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Textbook: An Introduction to Statistical Learning with Applications in R
8
Book Page:
http://www-bcf.usc.edu/~gareth/ISL/book.html
Fourth Printing:
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Fourth%20Printing.pdf
Simplified mathematical/statistical underpinning More rigorous approach: The Elements of Statistical Learning
Covers same material as in the course
Exercises: Important to Try BothConceptual Applied
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Conceptual Exercises: Curse of Dimensionality
9
ISLR, Chapter 4, Conceptual Exercises -- #4 – the problem with local approaches
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Conceptual Exercises: Curse of Dimensionality
10
ISLR, Chapter 4, Conceptual Exercises -- #4 – the problem with local approaches
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Conceptual Exercises: Curse of Dimensionality
11
ISLR, Chapter 4, Conceptual Exercises -- #4 – the problem with local approaches
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Conceptual Exercises: Backward Solving for Sample
12
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Curse of Dimensionality: The Ever-Expanding Cube
13
Wolfram Alpha query: https://www.wolframalpha.com/input/?i=lim(10%5E(-1%2Fd),+d+-%3E+infinity)&rawformassumption=%7B%22C%22,+%22d%22%7D+-%3E+%7B%22Variable%22%7D&rawformassumption=%22UnitClash%22+-%3E+%7B%22d%22,+%7B%22Days%22%7D%7D
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
KAGGLE KERNELSOnline playground
14
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Kaggle Kernels: Play With Other People’s Code!
15
Random Forest And Nearest Neighbors on a Few Blockshttps://www.kaggle.com/apapiu/random-forest-on-a-few-blocks
Alexandru Papiu, https://www.kaggle.com/apapiu/random-forest-on-a-few-blocks Accessed 11 Sept 2017
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Can See Code and Its Result: Graph Example
16
ggplot(small_train, aes(x, y )) +
geom_point(aes(color = place_id)) +
theme_minimal() +
theme(legend.position = "none") +
ggtitle("Check-ins colored by place_id")
Alexandru Papiu, https://www.kaggle.com/apapiu/random-forest-on-a-few-blocks Accessed 11 Sept 2017ggplot cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Can See Code and Its Result: Non-Graph Example
17
Code:model_knn = FNN::knn(train = X, test = X_val, cl = small_train$place_id, k = 15) preds <- as.character(model_knn) truth <- as.character(small_val$place_id) mean(truth == preds)
Result:## [1] 0.5151964
Code:set.seed(131L) small_train$place_id <- as.factor(small_train$place_id) # ranger needs factors for classificationmodel_rf <- ranger(place_id ~ x + y + accuracy + hour + weekday + month + year,
small_train, num.trees = 100, write.forest = TRUE, importance = "impurity")
Result:## Growing trees.. Progress: 36%. Estimated remaining time: 55 seconds. ## Growing trees.. Progress: 86%. Estimated remaining time: 10 seconds.Alexandru Papiu, https://www.kaggle.com/apapiu/random-forest-on-a-few-blocks Accessed 11 Sept 2017
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Other Kaggle Kernels to Try in R
18
Exploring the Titanic Datasethttps://www.kaggle.com/mrisdal/exploring-survival-on-the-titanicGreat data set for beginners – passenger list on Titanic... and who survived Yes/no classification problemBeginners competition: https://www.kaggle.com/c/titanic
Exploratory Analysis Zillowhttps://www.kaggle.com/philippsp/exploratory-analysis-zillowshows correlation plots
Wiki Traffic Forecast Explorationhttps://www.kaggle.com/headsortails/wiki-traffic-forecast-exploration-wtf-edahas hidden code – unhide to see
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Titanic Survival Data Set
19
SOURCE: Megan Risdal, https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
Fork that Script!
20
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
See How It’s Done
21
SOURCE: Philipp Spachtholz, https://www.kaggle.com/philippsp/exploratory-analysis-zillow Accessed 12 Sept 2017
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
See How It’s Done
22
SOURCE: Philipp Spachtholz, https://www.kaggle.com/philippsp/exploratory-analysis-zillow Accessed 12 Sept 2017
© 2017 Conning, Inc. This research publication is copyrighted with all rights reserved. No part of this research publication may be reproduced, transcribed, transmitted, stored in an electronic retrieval system, or translated into any language in any form by any means without the prior written permission of Conning.
GOT YOUR OWN RECOMMENDATIONS?
23
Getting Started: Sources of tools and trainingSession 13September 2017 – Predictive Analytics Symposium
Michael Niemerg, FSA, MAAAHealthcare Actuary, Milliman Inc.
Machine Learning Toolbox
Statistics and Algorithms
Programming
Data wrangling
Data Visualization and Communication
Domain Knowledge
2
What’s Common in the Insurance Industry
Languages R
Python
SAS
Algorithms Generalized Linear Models
Penalized Regression
Decision Trees
5
What I wished I had known…
Don’t be intimidated Theory is deep – application is shallow
You don’t need an advanced math/stats/CS degree
Just get started Start with a single programming language and a problem
Don’t focus on only algorithms or only programming – do both
Spend your time wisely Focus on what matters to you
Get the gist of what isn’t important to you
Know your goals Are you curious to see what the buzz is about?
Do you need help with a specific problem?
Do you want to move into a predictive modeling career?
Just get started already!
6
Online Education
8
www.coursera.orgwww.datacamp.com
www.kaggle.com
www.udacity.com
www.machinelearningmastery.comwww.udemy.com
Useful Websites
9
https://dataelixir.com/
www.r-bloggers.com/
www.statsblogs.com/
www.win-vector.com/blog
www.datatau.com
www.kdnuggets.com
www.datatau.com
Personal Favorites
Book: An Introduction to Statistical Learning
Online Course: Machine Learning (Coursera – Andrew Ng)
Website: www.datatau.com
Blog/Newsletter: Data Elixir
Dataset Repository: Kaggle
10
Machine Learning in Action
Advantages Gentle on the math
Good introduction to the core concepts
First principles code and examples
Disadvantages Some familiarity with Python
Doesn’t teach pragmatic programming
11
Gradient Descent
A predictive model can have many possible parameter valuesSome parameters are better than others – how do we find the best values?
AlgorithmDetermine Learning Rate Initialize variable weightsRepeat R times: Calculate the gradient Update the weights:
Weights = Weights + Learning Rate * Gradient
12
Gradient Descent – Example
Goal – Using a vector of input data, find a single predicted value ( �𝑦𝑦) that minimizes our error function when compared to the dataData – [ 107, 93, 105, 107, 95, 82, 110, 104, 99, 99]Error Function:
�12𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 − 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝐴𝐴𝐴𝐴𝑃𝑃𝑃𝑃 2
Gradient:−� 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 − 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝐴𝐴𝐴𝐴𝑃𝑃𝑃𝑃
Parameters: Start with an initial guess of 165 Perform 5 rounds of gradient descent Use a learning rate of .05
13
Logistic Regression w/ Gradient Descent
15
Y X1 X20 1.75 1.370 1.05 1.520 2.43 1.310 2.06 0.51 1.45 2.641 0.95 4.381 1.97 3.51 1.51 3.69
Category 0 = WhiteCategory 1 = Black
Sample of Data Scatterplot of Full Dataset
Let’s start with a randomly generated dataset…