Exploratory Data Analysis with R

110

Transcript of Exploratory Data Analysis with R

Page 1: Exploratory Data Analysis with R
Page 2: Exploratory Data Analysis with R
Page 3: Exploratory Data Analysis with R
Page 4: Exploratory Data Analysis with R
Page 5: Exploratory Data Analysis with R

The ability to take data—to be able to understand it, toprocess it, to extract value from it, to visualize it, tocommunicate it—that’s going to be a hugely importantskill in the next decades, … because now we really dohave essentially free and ubiquitous data. So thecomplimentary scarce factor is the ability to understandthat data and extract value from it.

Hal Varian, Google’s Chief EconomistThe McKinsey Quarterly, Jan 2009

Page 6: Exploratory Data Analysis with R
Page 7: Exploratory Data Analysis with R

Job Postings for Data Scientists

Page 8: Exploratory Data Analysis with R

Source: Dice Salary Survey 2017

Top-paying Tech SkillsSkill 2016 Change Skill 2016 Change

Page 9: Exploratory Data Analysis with R

70%

60%

40%

30%

20%

10%

0%

50%

SQL

Exce

l

Pyt

ho

n

MyS

QLR

Pyt

ho

n t

oo

ls

ggp

lot

SQL

Serv

er

Tab

leau

Java

Scri

pt

Mat

plo

tlib

Java

Post

greS

QL

Ora

cle

D3

Ho

meg

row

n

Hiv

e

Spar

k

Clo

ud

era

Vis

ual

Bas

ic

Mo

ngo

DB

Had

oo

p

SAS

C+

+

Scal

a

Pow

erP

ivo

t

SQLi

te C

Pig

Red

Shif

t

Wek

a

Hb

ase

(EM

R)

Perl

SPSS

Tera

dat

a

Tool: language, platform, analytics

Shar

e o

f R

esp

on

den

ts

Source: O’Reilly 2015 Data Science Salary Survey

Data Science Tools

Page 10: Exploratory Data Analysis with R
Page 11: Exploratory Data Analysis with R
Page 12: Exploratory Data Analysis with R
Page 13: Exploratory Data Analysis with R

Overview

Introduction to R

Working with Data

Descriptive Statistics

Data Visualization

Beyond R and EDA

Page 14: Exploratory Data Analysis with R

Introduction to R

Page 15: Exploratory Data Analysis with R

What is R?

Open source

Language and environment

Numerical and graphical analysis

Cross platform

Page 16: Exploratory Data Analysis with R

What is R?

Active development

Large user community

Modular and extensible

9000+ extensions

and best of all…

Page 17: Exploratory Data Analysis with R

FREE

Page 18: Exploratory Data Analysis with R

FREE

Page 19: Exploratory Data Analysis with R

Source: http://redmonk.com/sogrady/2016/07/20/language-rankings-6-16/

Page 20: Exploratory Data Analysis with R
Page 21: Exploratory Data Analysis with R
Page 22: Exploratory Data Analysis with R
Page 23: Exploratory Data Analysis with R

Code Demo

Page 24: Exploratory Data Analysis with R

Working with Data

Page 25: Exploratory Data Analysis with R

Working with Data

Page 26: Exploratory Data Analysis with R

Working with Data

Page 27: Exploratory Data Analysis with R

Working with Data

Page 28: Exploratory Data Analysis with R

Working with Data

Page 29: Exploratory Data Analysis with R

Working with Data

Page 30: Exploratory Data Analysis with R

Working with Data

Data munging

Data wrangling

Data cleaning

Data cleansing

Page 31: Exploratory Data Analysis with R

Loading Data in R

Page 32: Exploratory Data Analysis with R

Loading Data in R

CSV

Page 33: Exploratory Data Analysis with R

Loading Data in R

CSV XML

Page 34: Exploratory Data Analysis with R

Loading Data in R

CSV XML

Page 35: Exploratory Data Analysis with R

Loading Data in R

CSV XML

Page 36: Exploratory Data Analysis with R

Cleaning Data

Page 37: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Page 38: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Rename columns

Page 39: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Rename columns

Convert data types

Page 40: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Rename columns

Convert data types

Ensure proper encoding

Page 41: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Rename columns

Convert data types

Ensure proper encoding

Ensure internal consistency

Page 42: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Rename columns

Convert data types

Ensure proper encoding

Ensure internal consistency

Handle errors and outliers

Page 43: Exploratory Data Analysis with R

Cleaning Data

Reshape data

Rename columns

Convert data types

Ensure proper encoding

Ensure internal consistency

Handle errors and outliers

Handle missing values

Page 44: Exploratory Data Analysis with R

Transforming Data

Page 45: Exploratory Data Analysis with R

Transforming Data

Select columns

Page 46: Exploratory Data Analysis with R

Transforming Data

Select columns

Select rows

Page 47: Exploratory Data Analysis with R

Transforming Data

Select columns

Select rows

Group rows

Page 48: Exploratory Data Analysis with R

Transforming Data

Select columns

Select rows

Group rows

Order rows

Page 49: Exploratory Data Analysis with R

Transforming Data

Select columns

Select rows

Group rows

Order rows

Merging data sets

Page 50: Exploratory Data Analysis with R

Exporting Data

File-based data

Web-based data

Databases

Statistical data

CSV XML

Page 51: Exploratory Data Analysis with R

Advice for Working with Data

Often difficult

Time consuming

TIP: Record all steps

Page 52: Exploratory Data Analysis with R

Movies

Title Year RatingRuntime(minutes) Genre

Critic Score

Box Office

The Whole Nine Yards 2000 R 98 Comedy 45% $57.3M

Cirque du Soleil 2000 G 39 Family 45% $13.4M

Gladiator 2000 R 155 Action 76% $187.3M

Dinosaur 2000 PG 82 Family 65% $135.6M

Big Momma's House 2000 PG-13 99 Comedy 30% $0.5M

Open Movies Database

Page 53: Exploratory Data Analysis with R
Page 54: Exploratory Data Analysis with R
Page 55: Exploratory Data Analysis with R

1. Column with wrong name

2. Rows with missing values

3. Runtime column has units

4. Revenue in multiple scales

5. Wrong file format

Page 56: Exploratory Data Analysis with R

Code Demo

Page 57: Exploratory Data Analysis with R
Page 58: Exploratory Data Analysis with R

Descriptive Statistics

Page 59: Exploratory Data Analysis with R

Descriptive Statistics

Describe data

Provides a summary

aka: Summary statistics

Movie Runtime

Statistic Value (minutes)

Minimum 38

1st Quartile 93

Median 101

Mean 104

3rd Quartile 113

Maximum 219

Page 60: Exploratory Data Analysis with R

Statistical Terms

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 61: Exploratory Data Analysis with R

Statistical Terms

Observations

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 62: Exploratory Data Analysis with R

Statistical Terms

Observations

Variables

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 63: Exploratory Data Analysis with R

Statistical Terms

Observations

Variables

Categorical variables

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 64: Exploratory Data Analysis with R

Statistical Terms

Observations

Variables

Categorical variables

Numeric variables

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 65: Exploratory Data Analysis with R

Types of Analysis

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 66: Exploratory Data Analysis with R

Analyzing One Categorical Variable

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 67: Exploratory Data Analysis with R

Analyzing One Categorical Variable

Frequency

Movies by Genre

Genre Frequency Percentage

Action 612 9%

Adventure 496 7%

Animation 168 2%

Comedy 1281 18%

Drama 1570 22%

Horror 269 4%

… … …

Page 68: Exploratory Data Analysis with R

Analyzing One Categorical Variable

Frequency

Proportion

Movies by Genre

Genre Frequency Percentage

Action 612 9%

Adventure 496 7%

Animation 168 2%

Comedy 1281 18%

Drama 1570 22%

Horror 269 4%

… … …

Page 69: Exploratory Data Analysis with R

Analyzing One Numeric Variable

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 70: Exploratory Data Analysis with R

Analyzing One Numeric Variable

Central tendency

Dispersion

Shape

Page 71: Exploratory Data Analysis with R

Analyzing Two Categorical Variables

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 72: Exploratory Data Analysis with R

Analyzing Two Categorical Variables

Joint frequency

Movies by Genre and Rating

Genre G PG PG-13 R Total

Action 2 70 311 229 612

Adventure 44 179 209 64 496

Animation 43 111 8 6 168

Comedy 45 258 472 506 1218

Drama 12 136 586 836 1570

Family 38 181 10 1 230

… … … … … …

Total 230 1207 2686 3058 7181

Page 73: Exploratory Data Analysis with R

Analyzing Two Categorical Variables

Joint frequency

Contingency table

Movies by Genre and Rating

Genre G PG PG-13 R Total

Action 2 70 311 229 612

Adventure 44 179 209 64 496

Animation 43 111 8 6 168

Comedy 45 258 472 506 1218

Drama 12 136 586 836 1570

Family 38 181 10 1 230

… … … … … …

Total 230 1207 2686 3058 7181

Page 74: Exploratory Data Analysis with R

Analyzing Two Categorical Variables

Joint frequency

Contingency table

Marginal frequency

Movies by Genre and Rating

Genre G PG PG-13 R Total

Action 2 70 311 229 612

Adventure 44 179 209 64 496

Animation 43 111 8 6 168

Comedy 45 258 472 506 1218

Drama 12 136 586 836 1570

Family 38 181 10 1 230

… … … … … …

Total 230 1207 2686 3058 7181

Page 75: Exploratory Data Analysis with R

Analyzing Two Categorical Variables

Joint frequency

Contingency table

Marginal frequency

Relative frequency

Movies by Genre and Rating

Genre G PG PG-13 R Total

Action 0.001 0.010 0.043 0.032 0.086

Adventure 0.006 0.025 0.029 0.009 0.069

Animation 0.006 0.015 0.001 0.001 0.023

Comedy 0.006 0.036 0.066 0.070 0.170

Drama 0.002 0.019 0.082 0.116 0.219

Family 0.005 0.025 0.001 0.001 0.033

… … … … … …

Total 0.032 0.168 0.374 0.426 1.000

Page 76: Exploratory Data Analysis with R

Analyzing Two Numeric Variables

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 77: Exploratory Data Analysis with R

Analyzing Two Numeric Variables

Explanatory vs. outcome

Covariance

Correlation

Page 78: Exploratory Data Analysis with R

Analyzing a Numeric Variable Grouped by a Categorical Variable

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 79: Exploratory Data Analysis with R

One categorical variable

One numeric variable

Aggregate measures

Analyzing a Numeric Variable Grouped by a Categorical Variable

Page 80: Exploratory Data Analysis with R

Analyzing Many Variables

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 81: Exploratory Data Analysis with R
Page 82: Exploratory Data Analysis with R

Cowboys &

The Musical

Space Invaders:

Extended Edition

Page 83: Exploratory Data Analysis with R

Code Demo

Page 84: Exploratory Data Analysis with R

Cowboys &

The Musical

Space Invaders:

Extended Edition

Page 85: Exploratory Data Analysis with R

Data Visualization

Page 86: Exploratory Data Analysis with R

Data Visualization

Visual data representation

Page 87: Exploratory Data Analysis with R

Data Visualization

Visual data representation

Human pattern recognition

Page 88: Exploratory Data Analysis with R

Data Visualization

Visual data representation

Human pattern recognition

Map dimensions to visual

Page 89: Exploratory Data Analysis with R

Data Visualization

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 90: Exploratory Data Analysis with R

Data Visualization

ID Date Customer Product Quantity

1 2015-08-27 John Pizza 2

2 2015-08-27 John Soda 2

3 2015-08-27 Jill Salad 1

4 2015-08-27 Jill Milk 1

5 2015-08-28 Miko Pizza 3

6 2015-08-28 Miko Soda 2

7 2015-08-28 Sam Pizza 1

8 2015-08-28 Sam Milk 1

Page 91: Exploratory Data Analysis with R

Types of Analysis

Type of Variable(s)

Nu

mb

er

of

Var

iab

les

One Categorical

Variable

One NumericVariable

TwoCategoricalVariables

TwoNumericVariables

Categorical & Numeric

Variable

ManyVariables

Page 92: Exploratory Data Analysis with R

Cowboys &

The Musical

Space Invaders:

Extended Edition

Page 93: Exploratory Data Analysis with R

Code Demo

Page 94: Exploratory Data Analysis with R

Feature Length PG

Warlordof the

Rings :

The

An Unexpected Adventure

Page 95: Exploratory Data Analysis with R

Beyond R and EDA

Page 96: Exploratory Data Analysis with R

This is just the tip of the iceberg!This is just the tip of the iceberg!

Page 97: Exploratory Data Analysis with R

Advanced Data Analysis with R

Cluster Analysis

Statistical Modeling

Dimensionality Reduction

Analysis of Variance (ANOVA)

Page 98: Exploratory Data Analysis with R

Source: Nathan Yau (www.flowingdata.com)

Page 99: Exploratory Data Analysis with R

Machine Learning with R

Page 100: Exploratory Data Analysis with R
Page 101: Exploratory Data Analysis with R

Photos by Radomił Binek,

Danielle Langlois, and Frank Mayfield

Page 102: Exploratory Data Analysis with R

Code Demo

Page 103: Exploratory Data Analysis with R
Page 104: Exploratory Data Analysis with R

Where to Go Next…

R website: http://www.cran.r-project.org

RStudio: https://www.rstudio.com

Revolutions: http://blog.revolutionanalytics.com

Flowing Data: http://flowingdata.com

R-Blogger: http://www.r-bloggers.com

R-Seek: http://rseek.org

Page 105: Exploratory Data Analysis with R

www.pluralsight.com/authors/matthew-renze

Data Science with R

Exploratory Data Analysis with R

Data Visualization with R (3-part)

Data Science: The Big Picture

Page 106: Exploratory Data Analysis with R

www.matthewrenze.com

Page 107: Exploratory Data Analysis with R

Feedback

Very important to me!

One thing you liked?

One thing I could improve?

Page 108: Exploratory Data Analysis with R

Conclusion

Page 109: Exploratory Data Analysis with R

Conclusion

Introduction to R

Working with Data

Descriptive statistics

Data visualization

Beyond R & EDA

Page 110: Exploratory Data Analysis with R

Thank You!

Matthew Renze

Data Science Consultant

Renze Consulting

Twitter: @matthewrenze

Email: [email protected]

Website: www.matthewrenze.com