Collaborative Filtering in Map/Reduce

Post on 08-May-2015

4.558 views 1 download

Transcript of Collaborative Filtering in Map/Reduce

Collaborative Filteringin

Map/Reduce

Ole-Martin Mørk - Open AdExchange

tirsdag 14. september 2010

Vision

• Learn that Map/Reduce is simple

• Learn that Map/Reduce may be powerful

• Collaborative Filtering is fun!

tirsdag 14. september 2010

Agenda

• Map/Reduce

• Collaborative Filtering

• Collaborative Filtering with Map/Reduce

• Amazon Elastic MapReduce

tirsdag 14. september 2010

Map/Reduce

tirsdag 14. september 2010

Map/Reduce

• Very scalable algorithm

• Inspirered by map and reduce from functional programming.

• Everything is based on key/value

tirsdag 14. september 2010

6 phases

• Reader

• Map

• Partition

• Comparison

• Reduce

• Writer

tirsdag 14. september 2010

6 phases

• Reader

•Map

• Partition

• Comparison

•Reduce

• Writer

tirsdag 14. september 2010

Map

tirsdag 14. september 2010

List(“hello”,“dude”).map{x=>x.substring(0,1)}

functional map

tirsdag 14. september 2010

Map/Reduce map

• Input is key/value

• Output is key/value

tirsdag 14. september 2010

Simple Example, Map

• Count occurences of words in a document

• Input is: <linenumber>, <content of line>

• For each word on the line, the output is <word>, <count>

tirsdag 14. september 2010

Map

tirsdag 14. september 2010

Reducetirsdag 14. september 2010

functional reduce

val sum=List(32,40,23).reduceLeft{_+_}

tirsdag 14. september 2010

Map/Reduce reduce

• Input is key/list of values

• Output is key/value

tirsdag 14. september 2010

Simple Example, Reduce

• Reduce input is <word, counts>

• For each value we increase the count

• Output is <word>, <sum of counts>

tirsdag 14. september 2010

Reduce

tirsdag 14. september 2010

CollaborativeFiltering

tirsdag 14. september 2010

Amazon

tirsdag 14. september 2010

Last.fm

tirsdag 14. september 2010

Sceneami.com

tirsdag 14. september 2010

User based

• Useful when we have

• Small number of users

• High correlation between users

• Data that changes often

tirsdag 14. september 2010

Item based

• Useful for big sites like Amazon etc..

• Small overlap between users

• Mostly static data

tirsdag 14. september 2010

Min

drø

mm

eapp

likas

jon

Pattern Matching in Scala

Euclidean Distance

Rating

Rating

Match

Match

tirsdag 14. september 2010

Euclidean Distance

• Alf‘s presentations:1,25,56,57,58,98 (6)

• Kari’s presentations: 2,25,98,99 (4)

• Equal presentations: 25 and 98 (2)

• Unmatched presentations: 6-2 + 4-2 = 6

• Distance score: 1/1+sqr(6)= 0.29

tirsdag 14. september 2010

Recommended sessions

• Me:1,2,5,6,7

• Kate (0.31): 5,6,8,9

• Paul (0.41): 1,2,4,5,6

• Mary(0.31):1,5,8,9

tirsdag 14. september 2010

Recommended sessions

• Me:1,2,5,6,7

• Kate (0.31): 5,6,8,9

• Paul (0.41): 1,2,4,5,6

• Mary(0.31):1,5,8,9

• Recommended: 8 (0.62)

tirsdag 14. september 2010

Recommended sessions

• Me:1,2,5,6,7

• Kate (0.31): 5,6,8,9

• Paul (0.41): 1,2,4,5,6

• Mary(0.31):1,5,8,9

• Recommended: 8 (0.62), 9 (0.62)

tirsdag 14. september 2010

Recommended sessions

• Me:1,2,5,6,7

• Kate (0.31): 5,6,8,9

• Paul (0.41): 1,2,4,5,6

• Mary(0.31):1,5,8,9

• Recommended: 8 (0.62), 9 (0.62), 4 (0.41)

tirsdag 14. september 2010

Demo

tirsdag 14. september 2010

More Map/Reduce

tirsdag 14. september 2010

Several iterations

Iteration 1

Iteration 2

Iteration 3

tirsdag 14. september 2010

Several iterations

Iteration 3

Iteration 1 Iteration 2

tirsdag 14. september 2010

Partitioning

Reducer Reducer

Jeff

Kate

Mary

Ali

Lea

Paul

Paul Mary Kate Lea Jeff Ali

tirsdag 14. september 2010

Comparison

Reducer Reducer

Pres 2

Kate

Pres 2 JeffPres 2

Mary

Pres 1

Paul

Pres 1 AliPres 1

Lea

Pres 1Pres 1Pres 1 Pres 2Pres 2Pres 2Paul Lea Ali Jeff Mary Kate

tirsdag 14. september 2010

Guidelines

• Never access external sources during computation.

• Your functions should be small and fast

• You might not have all the data available

tirsdag 14. september 2010

Hadoop

• Hadoop is reusing objects, so remember to clone if you plan to keep them.

• You can read and write all objects implementing hadoop.WritableComparable

• write(DataOutput)

• readFields(DataInput)

• compareTo(Object)

tirsdag 14. september 2010

Collaborative Filtering, the Map/Reduce way

tirsdag 14. september 2010

Overview

• Create an application that recommends JavaZone presentations.

• Overall goal: Scalable performance

• 4 iterations

• Reading input from text file

tirsdag 14. september 2010

Iteration 1

• Map input: <user>, <presentations>

• Map output: <presentation>, <user>

• Reduce output: <presentation>, <userList>

tirsdag 14. september 2010

Iteration 2

• Map input: <presentation>, <userList>

• Map output: <user>, <userList>

• Reduce input: <user>, <list of userList>

• Reduce output: <userTuplet>, <match count>

tirsdag 14. september 2010

Iteration 3

• Map input: <userTuplet>, <match count>

• Map output: <userTuplet>, <diff>

• Map output: <userTuplet reversed>, <diff>

• Reduce output: <user>, <similaruser>

tirsdag 14. september 2010

Iteration 4

• Map input: <user>, <similaruser>

• Map output: <user>, <presentation with score>

• Reduce output: <user>, <presentations>

tirsdag 14. september 2010

Demo

tirsdag 14. september 2010

Map/Reduce on EC2

tirsdag 14. september 2010

Elastic Map/Reduce

• Same code

• Same input

• Different configuration

tirsdag 14. september 2010

Upload files

s3cmd put oax-jz10:jar/oax-jz10.jar target/oax.jz10.jar

s3cmd.rb put oax-jz10:input/data.txt data.txt

tirsdag 14. september 2010

Create job flow

elastic-mapreduce --create --alive --log-uri s3n://oax-jz10/log

tirsdag 14. september 2010

Register iterations

elastic-mapreduce --jobflow j-1NLAIW45QUN4B --jar s3n://oax-jz10/jar/oax-jz10.jar --arg com.openadex.pres.iterations.Iteration1 --arg s3n://oax-jz10/input --arg s3n://oax-jz10/output1

tirsdag 14. september 2010

Download output

s3cmd.rb get oax-jz10:output4/part-00000 out

tirsdag 14. september 2010

Demo

tirsdag 14. september 2010

Summary

• Map/Reduce may be simple

• Map/Reduce can be really powerful

• Collaborative filtering is fun :-)

tirsdag 14. september 2010

tirsdag 14. september 2010

Thank you

Ole-Martin Mørkolemartin@gmail.comtwitter.com/olemartin

del.icio.us/olemartin/jz10

All images are licensed with Creative Commons. See http://bit.ly/mr-photos for details,

tirsdag 14. september 2010