Slides from R crash course by Ilmo van der Löwe

Post on 19-Jun-2015

482 views 3 download

Tags:

description

Slides for an introductory R class at the University of Cambridge

Transcript of Slides from R crash course by Ilmo van der Löwe

CAMBRIDGE PROSOCIALITYAND WELL-BEING LABORATORY

CRASH COURSE

DATA SCIENTISTThe Sexiest Job of the 21st Century

Statistics

Domainexpertise

Hacking

BIGDATA

ish

SOCIAL NETWORK DATA

DIGITAL TRACE DATA

GLOBAL SURVEY DATA

GENETIC DATA

SPSS ain’t gonna cut it.

Windows Mac Linux

Built by scientists for scientists.

“We have named our language R – in part to acknowledge the in!uence of S and in part to celebrate our own e"orts.”

Ross IhakaPROFESSOR OF STATISTICS

University of Auckland

Robert GentlemanSENIOR DIRECTOR OF BIOINFORMATICS

Genentech

R is the most powerful statistics language

in the world.

• Open source- Free as in speech and beer

• Cross-platform- Runs on Windows, Mac, and Linux

• Versatile and extensible- Over 4,000 user-contributed packages

• General-purpose programming language- You can make it do things automagically

http://r-project.org

RStudio.org

Why use ?

R is used by the best.

"...a way to organize the brainpower of the world’s most talented data scientists..."

Hal VarianCHIEF ECONOMIST

software on

50%of winners use R

• Everything in one system- base: linear and nonlinear modeling,

classical statistical tests, time-series analysis, classi#cation, clustering etc.

- packages from multilevel modeling to medical image analysis

• Custom functionality- Programming ➞ Automation

4,403 available packages

• Automate away “click-click-click” tasks- More e$cient work

• Share analyses and data with ease- Better collaboration

• Make results reproducible- Better science

How do I use ?

You use R by typing commands, not with a mouse.

You use R by typing commands, not with a mouse.

R version 2.14.1 (2011-12-22)Copyright (C) 2011 The R Foundation for Statistical ComputingISBN 3-900051-07-0Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.You are welcome to redistribute it under certain conditions.Type 'license()' or 'licence()' for distribution details.

Natural language support but running in an English locale

R is a collaborative project with many contributors.Type 'contributors()' for more information and'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or'help.start()' for an HTML browser interface to help.Type 'q()' to quit R.

> Prompt

How do you know what to type to R?

For beginners:

For the statistically minded:

For programmers:

The very basics

Put “this” in “here”

“this”HERE

Put “this” in “here”

“this”HERE

Put “this” in here

here <- “this”

Put “this” in here

here <- “this”

variable

Put “this” in here

here <- “this”

a string

Put “this” in here

here <- “this”

assignment operator

Put “this” in here

here = “this”

assignment operator

>

here>

here[1] "this">

here

Row #

[1] "this">

functions and data

BLACK BOX

INPUTBLACK BOX

INPUTOUTPUT BLACK BOX

FUNCTION

INPUTFUNCTION

INPUTOUTPUT FUNCTION

FUNCTIONS ARE LIKE FACTORIES.

( )INPUTOUTPUT

In R, parenthesesmean: “DO SOMETHING”

(according to my instructions)

x.bar <- mean(x)

>

mean(x>

mean(x+>

mean(x

Waits for more

+>

( )INPUT

OUTPUT is captured into VARIABLES.

In R, things are often stored in vectors, lists, matrices, or data frames.

Vector

• The work horse of R

- Even individual numbers are a special cases of vectors (i.e., a vector of one)

• All elements have to be of the same mode

- Vectors of numbers are ok

‣ c(0,1,2,3,4,5,6,7,8,9)- So are vectors of character strings

‣ c("Ilmo","Alex","Chris")

us <- c("Ilmo","Alex","Chris")

us[1]us[2:3]length(us)class(us)

us <- c("Ilmo","Alex","Chris")

us[1]us[2:3]length(us)class(us)

Very classycharacters,

indeed!

List

• Mix and match!

- Lists can store things of di"erent modes

- Numeric, character, data frames...

• Many functions return a list for later use

me <- list(name = "Ilmo", legs = 2)

me$nameme$legsme["name"]me[["name"]]

Matricesare two-dimensional vectors

[,1] [,2] [1,] "Ilmo" "Alex" [2,] "Chris" "Dacher"

[,1] [,2][1,] 1.09 4.20[2,] 2.86 2.92

A numeric matrix

A character string matrix

ucb <- rbind( c("Ilmo","Alex"), c("Chris","Dacher") )

ucb[1,1]ucb[,1]ucb[2,2]

Data Frames

• The best of both lists and matrices

- Columns and rows‣ Each column contains data of a single mode

‣ Each row can contain data of various modes

• Usually created by reading data from a #le or database

DATA FRAMES ARE LIKE WAREHOUSES.

age gender height weight

1

2

3

d[,]

age gender height weight

1

2

3

d[1,]

age gender height weight

1

2

3

d[,1]

age gender height weight

1

2

3

d[,”age”]

age gender height weight

1

2

3

d$age

age gender height weight

1

2

3

d[,1:3]

age gender height weight

1

2

3

d[2,2]

age gender height weight

1

2

3

d[2,c(“age”,”weight”)]

d <- read.csv("MyNobelPrizeData.csv")

What will this do?

d <- read.spss("thatExperiment.sav")Error: could not find function "read.spss"

library("foreign")

library("foreign")

Minitab, S, SAS, Stata, Systat, and dBase

library("foreign")

Minitab, S, SAS, Stata, Systat, and dBase

...but no Excel

install.packages("xlsx")

read.xlsx("recipes.xlsx")

read.xlsx("recipes.xlsx")Error in read.xls("recipes.xlsx"):

read.xlsx("recipes.xlsx")Error in read.xls("recipes.xlsx"): Please provide a sheet name OR a sheet index.

read.xlsx("recipes.xlsx")Error in read.xls("recipes.xlsx"): Please provide a sheet name OR a sheet index.

WTF is a “sheet index”?

Two-step guide to solving R problems

Step 1: Search

help(read.xlsx)or

?read.xlsxR has a lovely built-in documentation system.Most often, all that you need is right there.

Step 1: Search

help.search("bar plot")or??”bar plot"

When you don’t exactly know what you arelooking for, use free-text search.

Step 1: Search

Google it.

You are probably not the #rst person to encounter the error. Paste the error message to Google and see what pops up.

Step 1: Search

rseek.orgstackexchange.comreddit.com/r/rstatsRead the R expert forums.See if they already have solved the problem.

Step 1: Search

Step 2: Ask

Make a reproducible example.

Pin down the exact problem in as few lines of code as possible. Simplify until only the problem remains.

Step 2: Ask

Ask your friends.

Solving problems together is a great way to learn.

Step 2: Ask

Ask the experts online.

There’s R mailing list, statsexchange, rstats reddit, Quora, Twitter etc. You probably found these already with your Google searches.

Step 2: Ask

Step 2: Ask

They do this for living.

Ask the stats dept experts.

Ask Alex or me.

Step 2: Ask

...and show us what you have tried already.

Let’s dive in!

Who has anyprogrammingexperience?

Get your group on.

OPTIO

NAL

Source

Console

Workspace

Frank AnscombeSTATISTICIAN

ans

ans

ans

ans

x1 x2 x3 x4 y1 y2 y3 y41 10 10 10 8 8.04 9.14 7.46 6.582 8 8 8 8 6.95 8.14 6.77 5.763 13 13 13 8 7.58 8.74 12.74 7.714 9 9 9 8 8.81 8.77 7.11 8.845 11 11 11 8 8.33 9.26 7.81 8.476 14 14 14 8 9.96 8.10 8.84 7.047 6 6 6 8 7.24 6.13 6.08 5.258 4 4 4 19 4.26 3.10 5.39 12.509 12 12 12 8 10.84 9.13 8.15 5.5610 7 7 7 8 4.82 7.26 6.42 7.9111 5 5 5 8 5.68 4.74 5.73 6.89

a <- anscombe

a

summary(a$x1)summary(a[,1])

summary(a[,"x1"])

They all mean the same.

Min. 1st Qu. Median Mean 3rd Qu. Max. 4.0 6.5 9.0 9.0 11.5 14.0

What about the rest of a?

summary(a)

plot(a)

plot(a$x1, a$y1)

cor(a$x1, a$y1)cor.test(a$x1, a$y1)

a$x4 <- NULLa$y4 <- NULL

a[,c("x4","y4")] <- NULLa[,c(4,8)] <- NULL

NULLTRUEFALSENA