R meetup talk

Post on 17-Jul-2015

1.084 views 0 download

Tags:

Transcript of R meetup talk

Fast lookups in R

Joseph Adler

April 13 2010

About me

Relevant work

• Tasks– Computer security research

– Credit risk modeling

– Pricing strategy

– Direct marketing

• Places– American Express

– Johnson and Johnson

– DoubleClick

– VeriSign

– LinkedIn (now)

About me

Books

Today’s talk

What I wrote

If you need to store a big lookup table, consider implementing the table using an environment. Environment objects are implemented using hash tables. Vectors and lists are not. This means that looking up a value with n element in a list can take O(n) time. Looking up the value in an environment object takes O(1) time on average

Today’s talk

What I read after the book was printed

Re: [R] beginner Q: hashtable or dictionary?

From: Prof Brian Ripley <ripley_at_stats.ox.ac.uk> Date: Mon 30

Jan 2006 - 18:37:00 EST

On Sun, 29 Jan 2006, hadleywickham wrote:

>> use a 'list': > > Is a list O(1) for setting and getting?

Can you elaborate? R is a vector language, and normally you create

a list in one pass, and you can retrieve multiple elements at once.

Retrieving elements by name from a long vector (including a

list) is very fast, as an internal hash table is used.Does the

following item from ONEWS answer your question?

Indexing a vector by a character vector was slow if both

the vector and index were long (say 10,000). Now

hashing is used and the time should be linear in the

longer of the lengths (but more memory is used).

Indexing by number is O(1) except where replacement causes the

list vector to be copied. There is always the option to use match() to

convert to numeric indexing.

-- Brian D. Ripley,

Professor of Applied Statistics,

University of Oxford

Retrieving elements by name from a

long vector (including a list) is very

fast, as an internal hash table is used.

Professor Brian D. Ripley

Today’s talk

• A short introduction to objects in R

• Looking up values in R

– How lookup tables are implemented in R

– Measuring lookup speed

– Optimizing lookup speed

Objects in R

Everything in R is an object. Here are some

examples of objects.

Numeric Vector:

>onehalf<- 1/2

>class(onehalf)

[1] "numeric”

Objects in R

Integer Vector:

> four <- as.integer(4)

> four

[1] 4

>class(four)

[1] "integer”

Objects in R

Character vector:

> zero <- "zero"

>class(zero)

[1] "character”

Objects in R

Logical vector:

>this.is.interesting<- FALSE

>class(this.is.interesting)

[1] "logical"

Objects in R

Vectors can have multiple elements

>one.to.five<- 1:5

>class(one.to.five)

[1] "integer"

>six.to.ten<- c(6, 7, 8, 9, 10)

>class(six.to.ten)

[1] "numeric"

Objects in R

Lists contain heterogeneous collections of objects> stuff <- list(3.14, "hat", FALSE)

>class(stuff)

[1] "list"

Objects in R

Functions are also objects in R:

>f<- function(x, y) {

+ x + y

+ }

>f

function(x, y) {

x + y

}

>class(f)

[1] "function"

Objects in R

Environments map names to objects. They are

used within R itself to map variable names to

objects. You can access these environment

objects, or create your own.> one <- 1

> two <- 2

> three <- 3

> objects()

[1] "one" "three" "two"

>e<- .GlobalEnv

>class(e)

[1] "environment"

>objects(e)

[1] "e" "one" "three" "two"

Lookups

You can look up an item in a vector, list, or array

within R

– Let’s define a vector:

>a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

> a

[1] 1 2 3 4 5 6 7 8 9 10

– You can refer to elements by index:

>a[3]

[1] 3

Lookups

It's also possible to name elements in a vector, then refer to

them by name:

>b<- c(Joe=1, Bob=2, Jim=3)

>b["Bob"]

Bob

This can be very convenient: you can use every vector in R

as a table. You can access the name vector through the

names function:

>names(b)

[1] "Joe" "Bob" "Jim"

Lookups

Named vectors in R are implemented using two

different arrays:

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

Lookups

The name lookup algorithm works roughly like this:

function(vector, name) {

for (i in 1:length(vector)) {

if (names(vector)[i] == name)

return vector[i]

}

return NA

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[1]

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[2]

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[4]

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[4]

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[5]

Lookups

Example: Look up a.20[“F”]

B C D E F G H I J BA BB BC BD BE BF BG BH BI BJ CA

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20a.20

names(a.20)

names(a.20)[5]

Lookups

In vectors,

– Looking up a value by index takes a constant amount

of time.

– Looking up a value by name (potentially) requires

looking at every name in the names array. (This

means that lookup times scale linearly with the

number of items in the table.)

Lookups

Environments store (and fetch) data using a

different structure. They use hash tables.

Hash tables rely on a hash function to map labels

to indices.

Lookups

Simple hash table implementation

Example: store 15 ¾ for “Joe”

1. Calculate h(“Joe”)

2. Store 15 ¾ in the

table in slot h(“Joe”)

1

2

3

4 15 ¾

5

6

h(“Joe”) = 4

Lookups

If you carefully choose the size of the hash table

and the hash function, you can store and lookup

values in constant time (on average) in hash

tables.

Measuring Lookup Speed

In theory, looking up values in environments

should be faster than looking up values in vectors.

In practice, how much difference does this make?

Let’s measure how much time it takes to look up

values in vectors and environments, using different

lookup methods

Measuring Lookup Speed

Let's build a large, labeled vector for testing:labeled.array<- function(n) {

a <- 1:n

from <- “1234567890"

to <- "ABCDEFGHIJ"

for (i in 1:n) {

names(a)[i] <- chartr(from, to, i)

}

a

}

Here's an example of the output of this function:

>a.20 <- labeled.array(20)

>a.20

A B C D E F G H I AJ AA AB AC AD AE AF AG AH AI BJ

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Measuring Lookup Speed

Let's also create environment objects for testing:

labeled.environment<- function(n) {e<- new.env(hash=TRUE, size=n) from <- "1234567890”to <- "ABCDEFGHIJ”for (i in 1:n) {

assign(x=chartr(from, to, i),value=i, envir=e)

}e}

Here’s an example of the output of this function:

> e.20 <- labeled.environment(20)

> e.20

<environment: 0x143756c>

Measuring Lookup Speed

You can fetch values from an environment object

with the get function

>get("A",envir=e.20)

[1] 1

>get("BA",envir=e.20)

[1] 20

You can also fetch values from an environment

with the double bracket operator

> e.20[["A"]]

[1] 1

>e.20[["BA"]]

[1] 20

Measuring Lookup Speed

• Creating examples for testing

arrays <- list()

for (i in 10:15) {

arrays[[as.character(2 ** i)]] <-

labeled.array(2 ** i)

}

environments <- list()

for (i in 10:15) {

environments[[as.character(2 ** i)]] <-

labeled.environment(2 ** i)

}

Measuring Lookup Speed

• Using the test function:

test_expressions("first element, by index:",function(d,l,r) {s<- 0 for (v in 1:r) {s<- s + d[1]

}},arrays, 1024)

• Output:

first element, by index:1024 2048 4096 8192 16384 327680.010 0.003 0.004 0.003 0.005 0.004

Measuring Lookup Speed

• Results for 1024 lookups:

1024 2048 4096 8192 16384 32768

Array index First 0.01 0.003 0.004 0.003 0.005 0.004

Array index Last 0.01 0.004 0.004 0.004 0.003 0.004

Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397

Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266

Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002

Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107

Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003

Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112

Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005

Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005

Measuring Lookup Speed

• Results for 1024 lookups:

1024 2048 4096 8192 16384 32768

Array index First 0.01 0.003 0.004 0.003 0.005 0.004

Array index Last 0.01 0.004 0.004 0.004 0.003 0.004

Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397

Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266

Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002

Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107

Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003

Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112

Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005

Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005

Notice that these values increase linearly with the number of

elements in the array

Measuring Lookup Speed

• Results for 1024 lookups:

1024 2048 4096 8192 16384 32768

Array index First 0.01 0.003 0.004 0.003 0.005 0.004

Array index Last 0.01 0.004 0.004 0.004 0.003 0.004

Array Label Single Bracket First 0.268 0.282 0.588 1.439 2.728 5.397

Array Label Single Bracket Last 0.173 0.278 0.582 1.517 2.713 5.266

Array Label Double Bracket Exact First 0.002 0.002 0.002 0.002 0.003 0.002

Array Label Double Bracket Exact Last 0.036 0.07 0.136 0.273 0.549 1.107

Array Label Double Bracket Not exact First 0.01 0.003 0.003 0.002 0.003 0.003

Array Label Double Bracket Not exact Last 0.042 0.069 0.137 0.275 0.551 1.112

Environment Label First 0.012 0.005 0.006 0.006 0.005 0.005

Environment Label Last 0.012 0.005 0.006 0.005 0.006 0.005

Let’s focus on the results for the largest arrays (which are the

most precise)

Measuring Lookup Speed

• Results for 1024 lookups, 32768 elements:

Array index First 0.004

Array index Last 0.004

Array Label Single Bracket First 5.397

Array Label Single Bracket Last 5.266

Array Label Double Bracket Exact First 0.002

Array Label Double Bracket Exact Last 1.107

Array Label Double Bracket Not exact First 0.003

Array Label Double Bracket Not exact Last 1.112

Environment Label First 0.005

Environment Label Last 0.005

Optimizing Lookup Speed

How to write efficient code:

1. Write code for clarity, not speed

2. Check to see if the code is fast enough. If it is

fast enough, stop.

3. Test your code to find where time is being spent

4. Fix the parts of your code that are taking

enough time.

5. Go to step 2

Optimizing Lookup Speed

• How do you make lookups fast?

– Lookups by position are fastest

– If you have to lookup up single values by name, write

your code with double-brackets

• Double-bracket lookups are a little faster than single bracket

lookups

• If you discover that your code is too slow, you can easily

change from vectors to environments

Optimizing Lookup Speed

• What if

– Your code is too slow

– You need to look up values by name

– It would be hard to change your code to use double-

bracket notation

• Define a bracket operator for environments!

Optimizing Lookup Speed

Remember that everything in R is a function, even

lookup operators.

Example code:

>b<- c(Joe=1, Bob=2, Jim=3)

>b["Bob"]

Bob

2

Optimizing Lookup Speed

Translation of the example code:

>b["Bob"]

Bob

2

>as.list(quote(b["Bob"]))

[[1]]

`[`

[[2]]

b

[[3]]

[1] "Bob"

Optimizing Lookup Speed

R translates

b["B"]

to

`[`(b, "B")

Optimizing Lookup Speed

Here is the code for our new subset function

`[` <- function(x, i, j, ..., drop=TRUE) {

if (class(x) == "environment”) {

get(x=i, envir=x)

} else {

.Primitive("[")(x, i, j, ..., drop=TRUE)

}

}

Optimizing Lookup Speed

Assignments through bracket notation are a little

funny. For example, R evaluates

x[3:5] <- 13:15

as if this code had been executed:

`*tmp*` <- x

x<- "[<-"(`*tmp*`, 3:5, value=13:15)

rm(`*tmp*`)

Optimizing Lookup Speed

Here is the code for our new subset assignment

function

`[<-` <- function(x, i, j, ..., value) {

if (class(x) == "environment”) {

assign(x=i, value=value, envir=x)

# the assign statement returns value,

# but we want to return the environment:

x

} else {

.Primitive("[<-")(x, i, j, ..., value)

}

}

How to reach me

twitter: @jadler

http://www.linkedin.com/in/josephadler

baseballhacks@gmail.com

Backup Slides

• A function to test the performance of a lookup

function on an object:

test_expressions<-

function(description, fun, data, reps) {

cat(paste(description,"\n"))

results <- vector()

for (n in names(data)) {

results[[n]] <- system.time(

fun(data[[n]], as.integer(n), reps)

)[["user.self"]]

}

print(results)

}

To figure out the full argument list for the bracket

operator, use the getGeneric function:

>getGeneric("[")

standardGeneric for "[" defined from package "base"

function (x, i, j, ..., drop = TRUE)

standardGeneric("[", .Primitive("["))

<environment: 0x11a6828>

Methods may be defined for arguments: x, i, j, drop

Use showMethods("[") for currently available ones.

In general, you should set new methods with the setMethod function. Example:

setClass("myenv", representation(e="environment"))setMethod("[",signature(x="myenv", i="character”, j="missing"),function(x,i,j,...,drop=TRUE) {

get(x=i,envir=x@e)}

)

Unfortunately, R doesn’t let you redefine these operators for environments, so we have to do something trickier.