R training at Aimia

41
R INTRODUCTION COURSE Basics of Data Analysis and Visualisation in R Ali Arsalan Kazmi

Transcript of R training at Aimia

Page 1: R training at Aimia

R INTRODUCTION COURSEBasics of Data Analysis and Visualisation in R

Ali Arsalan Kazmi

Page 2: R training at Aimia

STRUCTURE FOR THE SESSION

For Discussion For Practical work

Page 3: R training at Aimia

1. Introduction

2. Fundamentals

3. Data Import and Export in R

4. Data Analysis and Manipulation

5. Data Visualisation

ROADMAP

Each section contains:

1. Subsections

2. Some Theory

3. Practical work

Page 4: R training at Aimia

INTRODUCTION

Page 5: R training at Aimia

• Initial Thoughts on R

• GNU Free

• Data Analysis and Superior Visualisation

• Burgeoning community of useRs

• #6 in IEEE 2015 Top Programming Languages

• Integration of R in SQL Server 2016

A BIT ABOUT R

• Your first impression about R?

• What do you already know about R?

Page 6: R training at Aimia

• Initial Thoughts on R

• GNU Free

• Data Analysis and Superior Visualisation

• Burgeoning community of useRs

• #6 in IEEE 2015 Top Programming Languages

• Integration of R in SQL Server 2016

A BIT ABOUT R

• Four (essential) freedoms granted

• Share the spirit

Page 7: R training at Aimia

• Initial Thoughts on R

• GNU Free

• Data Analysis and Superior Visualisation

• Burgeoning community of useRs

• #6 in IEEE 2015 Top Programming Languages

• Integration of R in SQL Server 2016

A BIT ABOUT R

• Clustering – Sophisticated and others

• Supervised Learning

• Deep Learning

• Integration with Hadoop, Spark, Storm

• Many more

Page 8: R training at Aimia

A BIT ABOUT R

Page 9: R training at Aimia

A BIT ABOUT R

Page 10: R training at Aimia

A BIT ABOUT R

Page 11: R training at Aimia

A BIT ABOUT R

Page 12: R training at Aimia

• Initial Thoughts on R

• GNU Free

• Data Analysis and Superior Visualisation

• Burgeoning community of useRs

• #6 in IEEE 2015 Top Programming Languages

• Integration of R in SQL Server 2016

A BIT ABOUT R

• Currently: 7,284 packages

• Strong presence on the web

• R Consortium

• Google, Ebay, Facebook, NYT, etc.

Page 13: R training at Aimia

• Initial Thoughts on R

• GNU Free

• Data Analysis and Superior Visualisation

• Burgeoning community of useRs

• #6 in IEEE 2015 Top Programming Languages

• Integration of R in SQL Server 2016

A BIT ABOUT R

• Link: http://spectrum.ieee.org/computing/software/the-2015-top-ten-programming-languages

• Ranked along with the general purpose languages

Page 14: R training at Aimia

• Initial Thoughts on R

• GNU Free

• Data Analysis and Superior Visualisation

• Burgeoning community of useRs

• #6 in IEEE 2015 Top Programming Languages

• Integration of R in SQL Server 2016

A BIT ABOUT R

• Link: http://blog.revolutionanalytics.com/2015/05/r-in-sql-server.html

Page 15: R training at Aimia

FUNDAMENTALS

Page 16: R training at Aimia

• Data Types

• Data Structures

• Control Structures

• Functions

FUNDAMENTALS

• Think the commonly used data types for Stats

• In R: Numeric/Double; Integer; Logical; Character; Factor

• Many more

Page 17: R training at Aimia

• Data Types

• Data Structures

• Control Structures

• Functions

FUNDAMENTALS

• How to store data? Logico-Computational considerations…

• In R: Atomic vectors; Lists; Matrices and Arrays; Dataframes

Page 18: R training at Aimia

• Data Types

• Data Structures

• Control Structures

• Functions

FUNDAMENTALS

• Control the flow of a programme’s/function’s logic

• If; IfElse; For; While; Repeat

Page 19: R training at Aimia

• Data Types

• Data Structures

• Control Structures

• Functions

FUNDAMENTALS

• “Every process in R is the result of a Function call” – John Chambers

• “Everything in R is an R object” – John Chambers

• Modularise; Customise; Optimise; Automate

• Transition from a useR to a programmeR (and on to a developeR)

Page 20: R training at Aimia

PRACTICAL SESSION

Page 21: R training at Aimia

DATA I/O

Page 22: R training at Aimia

• Sources for Data

• Types of Data

• Base R for I/O

• Packages for Data Import

DATA I/O

• Online Sources: Web; APIs; Dropbox; GitHub

• Offline Sources: Databases; flat files; zipped files

Page 23: R training at Aimia

• Sources for Data

• Types of Data

• Base R for I/O

• Packages for Data Import

DATA I/O

• .txt; .csv; .xlsx; .Rdata

• .html; .json; xml

• .xpt (SAS); .sav (SPSS); .dta (Stata)

Page 24: R training at Aimia

• Sources for Data

• Types of Data

• Base R for I/O

• Packages for Data Import

DATA I/O

• You can use base R to read a variety of data

• Can be slow with large data

• For exotic file types, use dedicated packages

Page 25: R training at Aimia

• Sources for Data

• Types of Data

• Base R for I/O

• Packages for Data Import

DATA I/O

• readr – fast import for .txt, .csv

• readxl – fast import for .xlsx

• R-commander for GUI-based import

Page 26: R training at Aimia

PRACTICAL SESSION

Page 27: R training at Aimia

DATA MANIPULATION & ANALYSIS

Page 28: R training at Aimia

• Subsetting Data

• Split-Apply-Combine

• Merging Data

• sqldf

• dplyr

• R-commander

DATA MANIPULATION & ANALYSIS

• Subsetting ≡ SELECT & WHERE in SQL

• Subset operators: [, [[, $

• Numeric or logical indexes are used to subset data

Page 29: R training at Aimia

• Subsetting Data

• Split-Apply-Combine

• Merging Data

• sqldf

• dplyr

• R-commander

DATA MANIPULATION & ANALYSIS

• Split a collection of data, Apply a function to each partition, Combine the result and present

• Collection ≡ data structure

• Splitting is different for data structures and data types

• Combination is different for data structures

Page 30: R training at Aimia

• Subsetting Data

• Split-Apply-Combine

• Merging Data

• sqldf

• dplyr

• R-commander

DATA MANIPULATION & ANALYSIS

• Merge ≡ JOINs in SQL

• Dataframes’ specific

Page 31: R training at Aimia

• Subsetting Data

• Split-Apply-Combine

• Merging Data

• sqldf

• dplyr

• R-commander

DATA MANIPULATION & ANALYSIS

• Link: https://cran.r-project.org/web/packages/sqldf/sqldf.pdf

• Write SQL in R

• Dataframes’ specific

• Limited to Data analysis and manipulation operations

Page 32: R training at Aimia

• Subsetting Data

• Split-Apply-Combine

• Merging Data

• sqldf

• dplyr

• R-commander

DATA MANIPULATION & ANALYSIS

• Link: https://cran.r-project.org/web/packages/dplyr/dplyr.pdf

• Expressive for most data manipulation

• Very efficient

• Consistent coding

• Directly connect with some RDBMS

Page 33: R training at Aimia

• Subsetting Data

• Split-Apply-Combine

• Merging Data

• sqldf

• dplyr

• R-commander

DATA MANIPULATION & ANALYSIS

• GUI

• Can assist in learning R

Page 34: R training at Aimia

PRACTICAL SESSION

Page 35: R training at Aimia

VISUALISATION

Page 36: R training at Aimia

• Grammar of Graphics

• ggplot2

• Bonus: Interactive Visualisation

VISUALISATION

• Graph is formed of well-defined constituents

• Grammar enables succinct definition of constituents

• Layer(s)

• Scale(s)

• Coordinate System

• Facetting/Trellis Graphics

Page 37: R training at Aimia

• Grammar of Graphics

• ggplot2

• Bonus: Interactive Visualisation

VISUALISATION

• Layer(s)

• Data

• Aesthetics (positions on x/y axes; colours, size, etc.)

• Statistical Transformation (none; Log; Squared; etc.)

• Geometric Object(s)

• Position Adjustment

• Scale(s) – control how data are mapped to each aesthetic

• Coordinate System

• Facetting/Trellis Graphics

Page 38: R training at Aimia

• Grammar of Graphics

• ggplot2

• Bonus: Interactive Visualisation

VISUALISATION

• Graph is formed of well-defined constituents

• Grammar enables succinct definition of constituents

• Insights into graphs’ structure

• Encourages Creativity

Page 39: R training at Aimia

• Grammar of Graphics

• ggplot2

• Bonus: Interactive Visualisation

VISUALISATION

• Link: http://docs.ggplot2.org/current/

• An implementation of (layered) Grammar of Graphics

• Elegant graphics

• Typical Stat graphs + more exotic graphs

• Works with dataframes

• Static graphics

Page 40: R training at Aimia

• Grammar of Graphics

• ggplot2

• Bonus: Interactive Visualisation

VISUALISATION

• Intended for the Web – HTML files

• Mostly based on D3 – Data Driven Documents

• Based on contributed packages

• Some under active development

• Not limited to dataframe datasets

Page 41: R training at Aimia

PRACTICAL SESSION