Large-scale social media analysis with Hadoop

124
Large-scale social media analysis with Hadoop Jake Hofman Yahoo! Research May 23, 2010 @jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 1 / 71

description

In this tutorial we will discuss the use of Hadoop for processing large-scale social data sets. We will first cover the map/reduce paradigm in general and subsequently discuss the particulars of Hadoop's implementation. We will then present several use cases for Hadoop in analyzing example data sets, examining the design and implementation of various algorithms with an emphasis on social network analysis.

Transcript of Large-scale social media analysis with Hadoop

Page 1: Large-scale social media analysis with Hadoop

Large-scale social media analysis with Hadoop

Jake Hofman

Yahoo! Research

May 23, 2010

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 1 / 71

Page 2: Large-scale social media analysis with Hadoop

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 2 / 71

Page 3: Large-scale social media analysis with Hadoop

1970s ∼ 101 nodesJOURNAL OF ANTHROPOLOGICAL RESEARCH

FIGURE 1

Social Network Model of Relationships in the Karate Club

3 34 1 33 2

8

9

10

19 18 16 18 17

This is the graphic representation of the social relationships among the 34 indi-

viduals in the karate club. A line is drawn between two points when the two

individuals being represented consistently interacted in contexts outside those of

karate classes, workouts, and club meetings. Each such line drawn is referred to as

an edge.

two individuals consistently were observed to interact outside the

normal activities of the club (karate classes and club meetings). That is, an edge is drawn if the individuals could be said to be friends outside

the club activities.This graph is represented as a matrix in Figure 2. All

the edges in Figure 1 are nondirectional (they represent interaction in both

directions), and the graph is said to be symmetrical. It is also possible to

draw edges that are directed (representing one-way relationships); such

456

27

26 i

25

CONFLICT AND FISSION IN SMALL GROUPS

to bounded social groups of all types in all settings. Also, the data

required can be collected by a reliable method currently familiar to

anthropologists, the use of nominal scales.

THE ETHNOGRAPHIC RATIONALE

The karate club was observed for a period of three years, from 1970 to 1972. In addition to direct observation, the history of the club prior to the period of the study was reconstructed through informants and club records in the university archives. During the period of observation, the club maintained between 50 and 100 members, and its activities included social affairs (parties, dances, banquets, etc.) as well as

regularly scheduled karate lessons. The political organization of the club was informal, and while there was a constitution and four officers, most decisions were made by concensus at club meetings. For its classes, the club employed a part-time karate instructor, who will be referred to as Mr. Hi.2

At the beginning of the study there was an incipient conflict between the club president, John A., and Mr. Hi over the price of karate lessons. Mr. Hi, who wished to raise prices, claimed the authority to set his own lesson fees, since he was the instructor. John A., who wished to stabilize prices, claimed the authority to set the lesson fees since he was the club's chief administrator.

As time passed the entire club became divided over this issue, and the conflict became translated into ideological terms by most club members. The supporters of Mr. Hi saw him as a fatherly figure who was their spiritual and physical mentor, and who was only trying to meet his own physical needs after seeing to theirs. The supporters of

John A. and the other officers saw Mr. Hi as a paid employee who was

trying to coerce his way into a higher salary. After a series of

increasingly sharp factional confrontations over the price of lessons, the

officers, led by John A., fired Mr. Hi for attempting to raise lesson prices unilaterally. The supporters of Mr. Hi retaliated by resigning and

forming a new organization headed by Mr. Hi, thus completing the fission of the club.

During the factional confrontations which preceded the fission, the club meeting remained the setting for decision making. If, at a given meeting, one faction held a majority, it would attempt to pass resolutions and decisions favorable to its ideological position. The other faction would then retaliate at a future meeting when it held the

majority, by repealing the unfavorable decisions and substituting ones

2 All names given are pseudomyms in order to protect the informants' anonymity. For similar reasons, the exact location of the study is not given.

453

• Few direct observations; highly detailed info on nodes andedges

• E.g. karate club (Zachary, 1977)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 3 / 71

Page 4: Large-scale social media analysis with Hadoop

1990s ∼ 104 nodes

• Larger, indirect samples; relatively few details on nodes andedges

• E.g. APS co-authorship network (http://bit.ly/aps08jmh)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 4 / 71

Page 5: Large-scale social media analysis with Hadoop

Present ∼ 107 nodes +

• Very large, dynamic samples; many details in node and edgemetadata

• E.g. Mail, Messenger, Facebook, Twitter, etc.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 5 / 71

Page 6: Large-scale social media analysis with Hadoop

What could you ask of it?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 6 / 71

Page 7: Large-scale social media analysis with Hadoop

What could you ask of it?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 6 / 71

Page 8: Large-scale social media analysis with Hadoop

Look familiar?

# ls -talh neat_dataset.tar.gz-rw-r--r-- 100T May 23 13:00 neat_dataset.tar.gz

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 8 / 71

Page 9: Large-scale social media analysis with Hadoop

Look familiar?1

# ls -talh twitter_rv.tar-rw-r--r-- 24G May 23 13:00 twitter_rv.tar

1http://an.kaist.ac.kr/traces/WWW2010.html@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 10 / 71

Page 10: Large-scale social media analysis with Hadoop

Agenda

Large-scale social media analysis with Hadoop

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71

Page 11: Large-scale social media analysis with Hadoop

Agenda

Large-scale social media analysis with Hadoop

GB/TB/PB-scale, 10,000+ nodes

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71

Page 12: Large-scale social media analysis with Hadoop

Agenda

Large-scale social media analysis with Hadoop

network & text data

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71

Page 13: Large-scale social media analysis with Hadoop

Agenda

Large-scale social media analysis with Hadoop

network analysis & machine learning

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71

Page 14: Large-scale social media analysis with Hadoop

Agenda

Large-scale social media analysis with Hadoop

open source Apache project for distributed storage/computation

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 11 / 71

Page 15: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:• Parsing and manipulating large text collections• Clustering coefficient, BFS, etc., for networks w/ billions of

edges• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 16: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:• Parsing and manipulating large text collections• Clustering coefficient, BFS, etc., for networks w/ billions of

edges• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 17: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:• Parsing and manipulating large text collections• Clustering coefficient, BFS, etc., for networks w/ billions of

edges• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 18: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:

• Parsing and manipulating large text collections• Clustering coefficient, BFS, etc., for networks w/ billions of

edges• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 19: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:• Parsing and manipulating large text collections

• Clustering coefficient, BFS, etc., for networks w/ billions ofedges

• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 20: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:• Parsing and manipulating large text collections• Clustering coefficient, BFS, etc., for networks w/ billions of

edges

• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 21: Large-scale social media analysis with Hadoop

Warning

You may be bored if you already know how to ...

• Install and use Hadoop (on a single machine and EC2)

• Run jobs in local and distributed modes

• Implement distributed solutions for:• Parsing and manipulating large text collections• Clustering coefficient, BFS, etc., for networks w/ billions of

edges• Classification, clustering

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 12 / 71

Page 22: Large-scale social media analysis with Hadoop

Selected resources

http://www.hadoopbook.com/

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71

Page 23: Large-scale social media analysis with Hadoop

Selected resources

http://www.cloudera.com/

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71

Page 24: Large-scale social media analysis with Hadoop

Selected resources

i

Data-Intensive Text Processingwith MapReduce

Jimmy Lin and Chris DyerUniversity of Maryland, College Park

Draft of February 19, 2010

This is a (partial) draft of a book that is in preparation for Morgan & Claypool Synthesis

Lectures on Human Language Technologies. Anticipated publication date is mid-2010.

Comments and feedback are welcome!

http://www.umiacs.umd.edu/∼jimmylin/book.html

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71

Page 25: Large-scale social media analysis with Hadoop

Selected resources

... and many more at

http://delicious.com/jhofman/hadoop

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71

Page 26: Large-scale social media analysis with Hadoop

Selected resources

... and many more at

http://delicious.com/pskomoroch/hadoop

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 13 / 71

Page 27: Large-scale social media analysis with Hadoop

Outline

1 Background (5 Ws)

2 Introduction to MapReduce (How, Part I)

3 Applications (How, Part II)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 14 / 71

Page 28: Large-scale social media analysis with Hadoop

What?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 15 / 71

Page 29: Large-scale social media analysis with Hadoop

What?

“... to create building blocks for programmers who justhappen to have lots of data to store, lots of data toanalyze, or lots of machines to coordinate, and whodon’t have the time, the skill, or the inclination tobecome distributed systems experts to build theinfrastructure to handle it.”

-Tom WhiteHadoop: The Definitive Guide

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 16 / 71

Page 30: Large-scale social media analysis with Hadoop

What?

Hadoop contains many subprojects:

We’ll focus on distributed computation with MapReduce.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 17 / 71

Page 31: Large-scale social media analysis with Hadoop

Who/when?

An overly brief history

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71

Page 32: Large-scale social media analysis with Hadoop

Who/when?

pre-2004Cutting and Cafarella develop open source projects for web-scale

indexing, crawling, and search

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71

Page 33: Large-scale social media analysis with Hadoop

Who/when?

2004Dean and Ghemawat publish MapReduce programming model,

used internally at Google

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71

Page 34: Large-scale social media analysis with Hadoop

Who/when?

2006Hadoop becomes official Apache project, Cutting joins Yahoo!,

Yahoo adopts Hadoop

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71

Page 35: Large-scale social media analysis with Hadoop

Who/when?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 18 / 71

Page 36: Large-scale social media analysis with Hadoop

Where?

http://wiki.apache.org/hadoop/PoweredBy

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 19 / 71

Page 37: Large-scale social media analysis with Hadoop

Why?

Why yet another solution?

(I already use too many languages/environments)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 20 / 71

Page 38: Large-scale social media analysis with Hadoop

Why?

Why a distributed solution?

(My desktop has TBs of storage and GBs of memory)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 20 / 71

Page 39: Large-scale social media analysis with Hadoop

Why?

Roughly how long to read 1TB from a commodity hard disk?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 21 / 71

Page 40: Large-scale social media analysis with Hadoop

Why?

Roughly how long to read 1TB from a commodity hard disk?

1

2

Gb

sec× 1

8

B

b× 3600

sec

hr≈ 225

GB

hr

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 21 / 71

Page 41: Large-scale social media analysis with Hadoop

Why?

Roughly how long to read 1TB from a commodity hard disk?

≈ 4hrs

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 21 / 71

Page 42: Large-scale social media analysis with Hadoop

Why?

http://bit.ly/petabytesort

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 22 / 71

Page 43: Large-scale social media analysis with Hadoop

Outline

1 Background (5 Ws)

2 Introduction to MapReduce (How, Part I)

3 Applications (How, Part II)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 23 / 71

Page 44: Large-scale social media analysis with Hadoop

Typical scenario

Store, parse, and analyze high-volume server logs,

e.g. how many search queries match “icwsm”?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 24 / 71

Page 45: Large-scale social media analysis with Hadoop

MapReduce: 30k ft

Break large problem into smaller parts, solve in parallel, combineresults

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 25 / 71

Page 46: Large-scale social media analysis with Hadoop

Typical scenario

“Embarassingly parallel”(or nearly so)

node 1local read filter

node 2local read filter

node 3local read filter

node 4local read filter

}collect results

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 26 / 71

Page 47: Large-scale social media analysis with Hadoop

Typical scenario++

How many search queries match “icwsm”, grouped by month?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 27 / 71

Page 48: Large-scale social media analysis with Hadoop

MapReduce: example

20091201,4.2.2.1,"icwsm 2010"20100523,2.4.1.2,"hadoop"20100101,9.7.6.5,"tutorial"20091125,2.4.6.1,"data"20090708,4.2.2.1,"open source"20100124,1.2.2.4,"washington dc"

20100522,2.4.1.2,"conference"20091008,4.2.2.1,"2009 icwsm"20090807,4.2.2.1,"apache.org"20100101,9.7.6.5,"mapreduce"20100123,1.2.2.4,"washington dc"20091121,2.4.6.1,"icwsm dates"

20090807,4.2.2.1,"distributed"20091225,4.2.2.1,"icwsm"20100522,2.4.1.2,"media"20100123,1.2.2.4,"social"20091114,2.4.6.1,"d.c."20100101,9.7.6.5,"new year's"

Mapmatching records to(YYYYMM, count=1)

200912, 1

200910, 1200911, 1

200912, 1

200910, 1...

200912, 1200912, 1

...200911, 1

200910, 1...200912, 2

...200911, 1

Shuffleto collect all recordsw/ same key (month)

Reduceresults by adding

count values for each key

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 28 / 71

Page 49: Large-scale social media analysis with Hadoop

MapReduce: paradigm

Programmer specifies map and reduce functions

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71

Page 50: Large-scale social media analysis with Hadoop

MapReduce: paradigm

Map: tranforms input record to intermediate (key, value) pair

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71

Page 51: Large-scale social media analysis with Hadoop

MapReduce: paradigm

Shuffle: collects all intermediate records by key

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71

Page 52: Large-scale social media analysis with Hadoop

MapReduce: paradigm

Reduce: transforms all records for given key to final output

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71

Page 53: Large-scale social media analysis with Hadoop

MapReduce: paradigm

Distributed read, shuffle, and write are transparent to programmer

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 29 / 71

Page 54: Large-scale social media analysis with Hadoop

MapReduce: principles

• Move code to data (local computation)

• Allow programs to scale transparently w.r.t size of input

• Abstract away fault tolerance, synchronization, etc.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 30 / 71

Page 55: Large-scale social media analysis with Hadoop

MapReduce: strengths

• Batch, offline jobs

• Write-once, read-many across full data set

• Usually, though not always, simple computations

• I/O bound by disk/network bandwidth

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 31 / 71

Page 56: Large-scale social media analysis with Hadoop

!MapReduce

What it’s not:

• High-performance parallel computing, e.g. MPI

• Low-latency random access relational database

• Always the right solution

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 32 / 71

Page 57: Large-scale social media analysis with Hadoop

Word count

dog 2-- 1the 3brown 1fox 2jumped 1lazy 2jumps 1over 2quick 1that 1who 1? 1

the quick brown foxjumps over the lazy dogwho jumped over thatlazy dog -- the fox ?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71

Page 58: Large-scale social media analysis with Hadoop

Word count

Map: for each line, output each word and count (of 1)

the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?

the 1quick 1brown 1fox 1---------jumps 1over 1the 1lazy 1dog 1---------who 1jumped 1over 1---------that 1lazy 1dog 1-- 1the 1fox 1? 1

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71

Page 59: Large-scale social media analysis with Hadoop

Word count

Shuffle: collect all records for each word

the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?

-- 1---------? 1---------brown 1---------dog 1dog 1---------fox 1fox 1---------jumped 1---------jumps 1---------lazy 1lazy 1---------over 1over 1---------quick 1---------that 1---------the 1the 1the 1---------who 1

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71

Page 60: Large-scale social media analysis with Hadoop

Word count

Reduce: add counts for each word

-- 1---------? 1---------brown 1---------dog 1dog 1---------fox 1fox 1---------jumped 1---------jumps 1---------lazy 1lazy 1---------over 1over 1---------quick 1---------that 1---------the 1the 1the 1---------who 1

-- 1? 1brown 1dog 2fox 2jumped 1jumps 1lazy 2over 2quick 1that 1the 3who 1

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71

Page 61: Large-scale social media analysis with Hadoop

Word countdog 1dog 1----------- 1---------the 1the 1the 1---------brown 1---------fox 1fox 1---------jumped 1---------lazy 1lazy 1---------jumps 1---------over 1over 1---------quick 1---------that 1---------? 1---------who 1

dog 2-- 1the 3brown 1fox 2jumped 1lazy 2jumps 1over 2quick 1that 1who 1? 1

the quick brown fox--------------------------------jumps over the lazy dog--------------------------------who jumped over that--------------------------------lazy dog -- the fox ?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 33 / 71

Page 62: Large-scale social media analysis with Hadoop

WordCount.java

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 34 / 71

Page 63: Large-scale social media analysis with Hadoop

Hadoop streaming

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 35 / 71

Page 64: Large-scale social media analysis with Hadoop

Hadoop streaming

MapReduce for *nix geeks2:

# cat data | map | sort | reduce

• Mapper reads input data from stdin

• Mapper writes output to stdout

• Reducer receives input, sorted by key, on stdin

• Reducer writes output to stdout

2http://bit.ly/michaelnoll@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 37 / 71

Page 65: Large-scale social media analysis with Hadoop

wordcount.sh

Locally:

# cat data | tr " " "\n" | sort | uniq -c

Distributed:

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 39 / 71

Page 66: Large-scale social media analysis with Hadoop

wordcount.sh

Locally:

# cat data | tr " " "\n" | sort | uniq -c

Distributed:

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 39 / 71

Page 67: Large-scale social media analysis with Hadoop

Transparent scaling

Use the same code on MBs locally or TBs across thousandsof machines.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 40 / 71

Page 68: Large-scale social media analysis with Hadoop

wordcount.py

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 42 / 71

Page 69: Large-scale social media analysis with Hadoop

Outline

1 Background (5 Ws)

2 Introduction to MapReduce (How, Part I)

3 Applications (How, Part II)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 43 / 71

Page 70: Large-scale social media analysis with Hadoop

Network data

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 44 / 71

Page 71: Large-scale social media analysis with Hadoop

Scale

• Example numbers:• ∼ 107 nodes• ∼ 102 edges/node• no node/edge data• static• ∼8GB

User 1

...

...

User 2

Simple, static networks push memory limit for commodity machines

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 45 / 71

Page 72: Large-scale social media analysis with Hadoop

Scale

• Example numbers:• ∼ 107 nodes• ∼ 102 edges/node• no node/edge data• static• ∼8GB

User 1

...

...

User 2

Simple, static networks push memory limit for commodity machines

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 45 / 71

Page 73: Large-scale social media analysis with Hadoop

Scale

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 46 / 71

Page 74: Large-scale social media analysis with Hadoop

Scale

• Example numbers:• ∼ 107 nodes• ∼ 102 edges/node• node/edge metadata• dynamic• ∼100GB/day

User 1

...

...

User 2HeaderContent...

Message

ProfileHistory...

UserProfileHistory...

User

Dynamic, data-rich social networks exceed memory limits; requireconsiderable storage

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 47 / 71

Page 75: Large-scale social media analysis with Hadoop

Scale

• Example numbers:• ∼ 107 nodes• ∼ 102 edges/node• node/edge metadata• dynamic• ∼100GB/day

User 1

...

...

User 2HeaderContent...

Message

ProfileHistory...

UserProfileHistory...

User

Dynamic, data-rich social networks exceed memory limits; requireconsiderable storage

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 47 / 71

Page 76: Large-scale social media analysis with Hadoop

Assumptions

Look only at topology, ignoring node and edge metadata

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71

Page 77: Large-scale social media analysis with Hadoop

Assumptions

Full network exceeds memory of single machine

...

... ...

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71

Page 78: Large-scale social media analysis with Hadoop

Assumptions

Full network exceeds memory of single machine

...

... ...

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71

Page 79: Large-scale social media analysis with Hadoop

Assumptions

First-hop neighborhood of any individual node fits in memory

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 48 / 71

Page 80: Large-scale social media analysis with Hadoop

Distributed network analysis

MapReduce convenient forparallelizing individualnode/edge-level calculations

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 49 / 71

Page 81: Large-scale social media analysis with Hadoop

Distributed network analysis

Higher-order calculationsmore difficult , but can beadapted to MapReduceframework

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 50 / 71

Page 82: Large-scale social media analysis with Hadoop

Distributed network analysis

• Networkcreation/manipulation

• Logs → edges• Edge list ↔ adjacency

list• Directed↔ undirected• Edge thresholds

• First-order descriptivestatistics

• Number of nodes• Number of edges• Node degrees

• Higher-order node-leveldescriptive statistics

• Clustering coefficient• Implicit degree• ...

• Global calculations• Pairwise connectivity• Connected

components• Minimum spanning

tree• Breadth-first search• Pagerank• Community detection

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 51 / 71

Page 83: Large-scale social media analysis with Hadoop

Edge list → adjacency list

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 84: Large-scale social media analysis with Hadoop

Edge list → adjacency list

1 41 51 67 67 13 11 34 23 22 810 22 93 44 3

source target

1 3 7 3 5 4 610 22 10 3 4 9 83 1 4 1 2 44 1 3 3 25 16 1 77 1 68 29 2

node in/out neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 85: Large-scale social media analysis with Hadoop

Edge list → adjacency list

Map: for each (source, target), output (source, →, target) &(target, ←, source)

1 4---------1 5---------1 6---------7 6---------7 1---------3 1---------1 3---------

source target

1 > 44 < 1-----------------1 > 55 < 1-----------------1 > 66 < 1-----------------7 > 66 < 7-----------------7 > 11 < 7-----------------3 > 11 < 3-----------------1 > 33 < 1-----------------

node direction neighbor

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 86: Large-scale social media analysis with Hadoop

Edge list → adjacency list

Shuffle: collect each node’s records

1 4---------1 5---------1 6---------7 6---------7 1---------3 1---------1 3---------

source target

1 < 31 < 71 > 31 > 41 > 51 > 6-----------------10 > 2-----------------2 < 102 < 32 < 42 > 82 > 9-----------------3 < 13 < 43 > 13 > 23 > 4

node direction neighbor

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 87: Large-scale social media analysis with Hadoop

Edge list → adjacency list

Reduce: for each node, concatenate all in- and out-neighbors

1 < 31 < 71 > 31 > 41 > 51 > 6-----------------10 > 2-----------------2 < 102 < 32 < 42 > 82 > 9-----------------3 < 13 < 43 > 13 > 23 > 4

...

node direction neighbor

1 3 7 3 5 4 610 22 10 3 4 9 83 1 4 1 2 44 1 3 3 25 16 1 77 1 68 29 2

node in/out neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 88: Large-scale social media analysis with Hadoop

Edge list → adjacency list

1 4---------1 5---------1 6---------7 6---------7 1---------3 1---------1 3---------

...

source target

1 < 31 < 71 > 31 > 41 > 51 > 6-----------------10 > 2-----------------2 < 102 < 32 < 42 > 82 > 9-----------------3 < 13 < 43 > 13 > 23 > 4

...

node direction neighbor

1 3 7 3 5 4 610 22 10 3 4 9 83 1 4 1 2 44 1 3 3 25 16 1 77 1 68 29 2

node in/out neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 89: Large-scale social media analysis with Hadoop

Edge list → adjacency list

Adjacency lists provide access to a node’s local structure — e.g.we can pass messages from a node to its neighbors.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 52 / 71

Page 90: Large-scale social media analysis with Hadoop

Degree distribution

1 2 410 0 12 3 23 2 34 2 25 1 06 2 07 0 28 1 09 1 0

node in/out-degree

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 53 / 71

Page 91: Large-scale social media analysis with Hadoop

Degree distribution

Map: for each node, output in- and out-degree with count (of 1)

in_0 1in_0 1---------in_1 1in_1 1in_1 1---------in_2 1in_2 1in_2 1in_2 1---------in_3 1---------out_0 1out_0 1out_0 1out_0 1---------out_1 1---------out_2 1out_2 1out_2 1---------out_3 1---------out_4 1

in 0 2in 1 3in 2 4in 3 1out 0 4out 1 1out 2 3out 3 1out 4 1

1 2 410 0 12 3 23 2 34 2 25 1 06 2 07 0 28 1 09 1 0

node in/out-degree

bin count

in/out degree count

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 54 / 71

Page 92: Large-scale social media analysis with Hadoop

Degree distribution

Shuffle: collect counts for each in/out-degree

in_0 1in_0 1---------in_1 1in_1 1in_1 1---------in_2 1in_2 1in_2 1in_2 1---------in_3 1---------out_0 1out_0 1out_0 1out_0 1---------out_1 1---------out_2 1out_2 1out_2 1---------out_3 1---------out_4 1

in 0 2in 1 3in 2 4in 3 1out 0 4out 1 1out 2 3out 3 1out 4 1

1 2 410 0 12 3 23 2 34 2 25 1 06 2 07 0 28 1 09 1 0

node in/out-degree

bin count

in/out degree count

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 54 / 71

Page 93: Large-scale social media analysis with Hadoop

Degree distribution

Reduce: add counts

in_0 1in_0 1---------in_1 1in_1 1in_1 1---------in_2 1in_2 1in_2 1in_2 1---------in_3 1---------out_0 1out_0 1out_0 1out_0 1---------out_1 1---------out_2 1out_2 1out_2 1---------out_3 1---------out_4 1

in 0 2in 1 3in 2 4in 3 1out 0 4out 1 1out 2 3out 3 1out 4 1

1 2 410 0 12 3 23 2 34 2 25 1 06 2 07 0 28 1 09 1 0

node in/out-degree

bin count

in/out degree count

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 54 / 71

Page 94: Large-scale social media analysis with Hadoop

Clustering coefficient

Fraction of edges amongst a node’s in/out-neighbors

?

?

— e.g. how many of a node’s friends are following each other?

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 55 / 71

Page 95: Large-scale social media analysis with Hadoop

Clustering coefficient

?

?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.810

0

101

102

103

104

105

106

107

108

clustering coefficient

coun

t

followersfriends

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71

Page 96: Large-scale social media analysis with Hadoop

Clustering coefficient

1

4

5

6

2

8

9

3

7

10

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71

Page 97: Large-scale social media analysis with Hadoop

Clustering coefficient

1

4

5

6

2

8

9

3

7

10

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71

Page 98: Large-scale social media analysis with Hadoop

Clustering coefficient

1

4

5

6

2

8

9

3

7

10

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 56 / 71

Page 99: Large-scale social media analysis with Hadoop

Clustering coefficient

Map: pass all of a node’s out-neighbors to each of its in-neighbors

3 1 3 5 4 67 1 3 5 4 6-------------------------10 2 9 83 2 9 84 2 9 8-------------------------1 3 1 2 44 3 1 2 4-------------------------1 4 3 23 4 3 2-------------------------1 5-------------------------1 67 6-------------------------2 8-------------------------2 9

1 3 7 3 5 4 6-------------------------10 2-------------------------2 10 3 4 9 8-------------------------3 1 4 1 2 4-------------------------4 1 3 3 2-------------------------5 1-------------------------6 1 7-------------------------7 1 6-------------------------8 2-------------------------9 2

node in/out-neighborsnode

outneighbor

two-hopneighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71

Page 100: Large-scale social media analysis with Hadoop

Clustering coefficient

Shuffle: collect each node’s two-hop neighborhoods

1 3 1 2 41 4 3 21 51 6-------------------------10 2 9 8-------------------------2 82 9-------------------------3 1 3 5 4 63 2 9 83 4 3 2-------------------------4 2 9 84 3 1 2 4-------------------------7 1 3 5 4 67 6

1 3 7 3 5 4 6-------------------------10 2-------------------------2 10 3 4 9 8-------------------------3 1 4 1 2 4-------------------------4 1 3 3 2-------------------------5 1-------------------------6 1 7-------------------------7 1 6-------------------------8 2-------------------------9 2

node in/out-neighbors nodeout

neighbortwo-hop

neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71

Page 101: Large-scale social media analysis with Hadoop

Clustering coefficient

Reduce: count a half-triangle for each node reachable by both aone- and two-hop path

1 3 1 2 41 4 3 21 51 6-------------------------10 2 9 8-------------------------2 82 9-------------------------3 1 3 5 4 63 2 9 83 4 3 2-------------------------4 2 9 84 3 1 2 4-------------------------7 1 3 5 4 67 6

nodeout

neighbortwo-hopneighbors

1 1.0------------10 0.0------------2 0.0------------3 1.0------------4 0.5------------7 0.5

node triangles

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71

Page 102: Large-scale social media analysis with Hadoop

Clustering coefficient

1 1.0------------10 0.0------------2 0.0------------3 1.0------------4 0.5------------7 0.5

node triangles

Note: this approach generates large amount of intermediate datarelative to final output.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 57 / 71

Page 103: Large-scale social media analysis with Hadoop

Breadth first search

Iterative approach: each MapReduce round expands the frontier

0

?

?

?

?

?

?

?

?

?

Map: If node’s distance d to source is finite, output neighbor’sdistance as d+1Reduce: Set node’s distance to minimum received from allin-neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71

Page 104: Large-scale social media analysis with Hadoop

Breadth first search

Iterative approach: each MapReduce round expands the frontier

0

1

1

1

?

?

?

1

?

?

Map: If node’s distance d to source is finite, output neighbor’sdistance as d+1Reduce: Set node’s distance to minimum received from allin-neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71

Page 105: Large-scale social media analysis with Hadoop

Breadth first search

Iterative approach: each MapReduce round expands the frontier

0

1

1

1

2

?

?

1

?

?

Map: If node’s distance d to source is finite, output neighbor’sdistance as d+1Reduce: Set node’s distance to minimum received from allin-neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71

Page 106: Large-scale social media analysis with Hadoop

Breadth first search

Iterative approach: each MapReduce round expands the frontier

0

1

1

1

2

3

3

1

?

?

Map: If node’s distance d to source is finite, output neighbor’sdistance as d+1Reduce: Set node’s distance to minimum received from allin-neighbors

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 58 / 71

Page 107: Large-scale social media analysis with Hadoop

Break complicated tasks into multiple, simpler MapReduce rounds.

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 59 / 71

Page 108: Large-scale social media analysis with Hadoop

Pagerank

Iterative approach: each MapReduce round broadcasts and collectsedge messages for power method 3

q1

q4

q5

q6

q2

q8

q9

q3

q7

q10

Map: Output current pagerank over degree to each out-neighborReduce: Sum incoming probabilities to update estimate

http://bit.ly/nielsenpagerank

3Extra rounds for random jump, dangling nodes@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 60 / 71

Page 109: Large-scale social media analysis with Hadoop

Machine learning

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 61 / 71

Page 110: Large-scale social media analysis with Hadoop

Machine learning

• Often use MapReduce for feature extraction, then fit/optimizelocally

• Useful for “embarassingly parallel” parts of learning, e.g.• parameters sweeps for cross-validation• independent restarts for local optimization• making predictions on independent examples

• Remember: MapReduce isn’t always the answer

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 62 / 71

Page 111: Large-scale social media analysis with Hadoop

Classification

Example: given words in an article, assign article to one of Kclasses

“Floyd Landis showed up at the Tour of California”

...

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 63 / 71

Page 112: Large-scale social media analysis with Hadoop

Classification

Example: given words in an article, assign article to one of Kclasses

“Floyd Landis showed up at the Tour of California”

...

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 63 / 71

Page 113: Large-scale social media analysis with Hadoop

Classification

Example: given words in an article, assign article to one of Kclasses

“Floyd Landis showed up at the Tour of California”

...

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 63 / 71

Page 114: Large-scale social media analysis with Hadoop

Classification: naive Bayes

• Model presence/absence of each word as independent coin flip

p (word|class) = Bernoulli(θwc)

p (words|class) = p (word1|class) p (word2|class) . . .

• Maximum likelihood estimates of probabilities from word andclass counts

θ̂wc =Nwc

Nc

θ̂c =Nc

N

• Use bayes’ rule to calculate distribution over classes givenwords

p (class|words,Θ) =p (words|class,Θ) p (class,Θ)

p (words,Θ)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 64 / 71

Page 115: Large-scale social media analysis with Hadoop

Classification: naive Bayes

• Model presence/absence of each word as independent coin flip

p (word|class) = Bernoulli(θwc)

p (words|class) = p (word1|class) p (word2|class) . . .

• Maximum likelihood estimates of probabilities from word andclass counts

θ̂wc =Nwc

Nc

θ̂c =Nc

N

• Use bayes’ rule to calculate distribution over classes givenwords

p (class|words,Θ) =p (words|class,Θ) p (class,Θ)

p (words,Θ)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 64 / 71

Page 116: Large-scale social media analysis with Hadoop

Classification: naive Bayes

• Model presence/absence of each word as independent coin flip

p (word|class) = Bernoulli(θwc)

p (words|class) = p (word1|class) p (word2|class) . . .

• Maximum likelihood estimates of probabilities from word andclass counts

θ̂wc =Nwc

Nc

θ̂c =Nc

N

• Use bayes’ rule to calculate distribution over classes givenwords

p (class|words,Θ) =p (words|class,Θ) p (class,Θ)

p (words,Θ)

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 64 / 71

Page 117: Large-scale social media analysis with Hadoop

Classification: naive Bayes

Naive ↔ independent features

Class-conditional word count

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 65 / 71

Page 118: Large-scale social media analysis with Hadoop

Classification: naive Bayes

Naive ↔ independent features

Class-conditional word count

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 65 / 71

Page 119: Large-scale social media analysis with Hadoop

Classification: naive Bayes

sports an 355sports be 317sports first 318sports game 379sports has 374sports have 284sports one 296sports said 325sports season 295sports team 279sports their 334sports this 293sports when 290sports who 363world after 347world but 299world government 300world had 352world have 342world he 355world its 308world mr 293world united 313world were 319

world Economics Is on Agenda for U.S. Meetings in China--------------------------------world U.K. Backs Germany's Effort to Support Euro--------------------------------sports A Pitchersʼ Duel Ends in Favor of the Yankees--------------------------------sports After Doping Allegations, a Race for Details

class word count

class words

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 66 / 71

Page 120: Large-scale social media analysis with Hadoop

Clustering:

Find clusters of “similar” points

http://bit.ly/oldfaithful

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 67 / 71

Page 121: Large-scale social media analysis with Hadoop

Clustering: K-means

Map: Assign each point to cluster with closest mean4, output(cluster, features)Reduce: Update clusters by calculating new class-conditionalmeans

http://en.wikipedia.org/wiki/K-means clustering

4Each mapper loads all cluster centers on init@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 68 / 71

Page 122: Large-scale social media analysis with Hadoop

Mahout

http://mahout.apache.org/

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 69 / 71

Page 123: Large-scale social media analysis with Hadoop

Thanks

• Sharad Goely

• Winter Masony

• Sid Suriy

• Sergei Vassilvitskiiy

• Duncan Wattsy

• Eytan Bakshym,y

y Yahoo! Research (http://research.yahoo.com)m University of Michigan

@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 70 / 71

Page 124: Large-scale social media analysis with Hadoop

Thanks.

Questions?5

5http://jakehofman.com/icwsm2010@jakehofman (Yahoo! Research) Large-scale social media analysis w/ Hadoop May 23, 2010 71 / 71