2013 06-03 berlin buzzwords

Post on 01-Nov-2014

11 views 0 download

Tags:

description

 

Transcript of 2013 06-03 berlin buzzwords

Scaling Apache Giraph

Nitay Joffe, Data Infrastructure Engineer

nitay@apache.org

June 3, 2013

Agenda

1 Background

2 Scaling

3 Results

4 Questions

Background

What is Giraph?• Apache open source graph computation engine based on Google’s Pregel.• Support for Hadoop, Hive, HBase, and Accumulo.• BSP model with simple think like a vertex API.• Combiners, Aggregators, Mutability, and more.• Configurable Graph<I,V,E,M>:

– I: Vertex ID

– V: Vertex Value

– E: Edge Value

– M: Message data

What is Giraph NOT?• A Graph database. See Neo4J.• A completely asynchronous generic MPI system.• A slow tool.

implementsWritable

Why not Hive?

Inputformat

Outputformat

Map tasks

Intermediatefiles

Reducetasks

Output 0

Output 1

Input 0

Input 1

Iterate!

• Too much disk. Limited in-memory caching.• Each iteration becomes a MapReduce job!

Giraph components

Master – Application coordinator

• Synchronizes supersteps

• Assigns partitions to workers before superstep begins

Workers – Computation & messaging

• Handle I/O – reading and writing the graph

• Computation/messaging of assigned partitions

ZooKeeper

• Maintains global application state

Giraph Dataflow

Split 0

Split 1

Split 2

Split 3

Work

er

1

Mast

er

Work

er

0Input

formatLoad / SendGraph

Load / SendGraph

Loading the graph

1

Part 0

Part 1

Part 2

Part 3

Compute / Send

Messages

Work

er

1

Compute / Send

Messages

Mast

er

Work

er

0

In-memory graph

Send stats / iterate!

Compute/Iterate

2

Work

er

1W

ork

er

0 Part 0

Part 1

Part 2

Part 3

Output format

Part 0

Part 1

Part 2

Part 3

Storing the graph

3

Split 4

Split

Giraph Job Lifetime

Output

Active Inactive

Vote to Halt

Received Message

Vertex Lifecycle

All Vertices Halted?

InputCompute Superstep

No

Master halted?

No

Yes

Yes

Simple Example – Compute the maximum value

5

15

2

5

5

25

5

5

5

5

1

2

Processor 1

Processor 2

Time

Connected Componentse.g. Finding Communities

PageRank – ranking websites

Mahout (Hadoop)854 lines

Giraph< 30 lines

• Send neighbors an equal fraction of your page rank • New page rank = 0.15 / (# of vertices) + 0.85 *

(messages sum)

Scaling

Problem: Worker Crash.

Superstep i(no

checkpoint)

Superstep i+1

(checkpoint)

Superstep i+2(no

checkpoint)Worker failure!

Superstep i+1

(checkpoint)

Superstep i+2(no

checkpoint)

Superstep i+3

(checkpoint)Worker failure after

checkpoint complete!

Superstep i+3(no

checkpoint)

ApplicationComplete…

Solution: Checkpointing.

“Spare”Master 2

ActiveMaster State“Spare”

Master 1

“Active”Master 0

Before failure of active master 0

“Spare”Master 2

ActiveMaster State“Active”

Master 1

“Active”Master 0

After failure of active master 0

ZooKeeper ZooKeeper

Problem: Master Crash.

Solution: ZooKeeper Master Queue.

Problem: Primitive Collections.• Graphs often parameterized with {Null,Int,Long,Float,Double}• Boxing/unboxing. Objects have internal overhead.

3

Solution: Use fastutil, e.g. Long2DoubleOpenHashMap.

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion

1

24

5

1.2

0.50.8

0.4

1.7

0.7

Single Source Shortest Path

s

t

1.2

0.50.8

0.4

0.2

0.7

Network Flow

3

1

24

5

Count In-Degree

Problem: Too many objects.Lots of time spent in GC.

Graph: 1B Vertices, 200B Edges, 200 Workers.

• 1B Edges per Worker. 1 object per edge value.• List<Edge<I, E>> ~ 10B objects

• 5M Vertices per Worker. 10 objects per vertex value.• Map<I, Vertex<I, V, E> ~ 50M objects

• 1 Message per Edge. 10 objects per message data.• Map<I, List<M>> ~ 10B objects

• Objects used ~= O(E*e + V*v + M*m) => O(E*e)

Label Propagatione.g. Who’s sleeping?

3

1

24

5

Boring

Amazing

Q: What did he think?

0.5

0.2

0.8 0.36

0.17

0.41

Confusing

Problem: Too many objects.Lots of time spent in GC.

Solution: byte[]• Serialize messages, edges, and vertices.• Iterable interface with representative object.

Input Input Input

next()next()

next()Objects per worker ~= O(V)

Label Propagatione.g. Who’s sleeping?

3

1

24

5

Boring

Amazing

Q: What did he think?

0.5

0.2

0.8 0.36

0.17

0.41

Confusing

Problem: Serialization of byte[]• DataInput? Kyro? Custom?

Solution: Unsafe• Dangerous. No formal API. Volatile. Non-portable (oracle JVM only).

• AWESOME. As fast as it gets.• True native. Essentially C: *(long*)(data+offset);

Problem: Large Aggregations.

Worker

Worker

Worker Worke

r

Worker

Master

Workers own aggregators

Worker

Worker

Worker Worke

r

Worker

Master

Aggregator owners communicatewith Master

Worker

Worker

Worker Worke

r

Worker

Master

Aggregator owners distribute values

Solution: Sharded Aggregators.

Worker

Worker

Worker Worke

r

Worker

Master

K-Means Clusteringe.g. Similar Emails

Problem: Network Wait.• RPC doesn’t fit model.• Synchronous calls no good.

Solution: NettyTune queue sizes & threads

BarrierBarrier

Begin superstep

computenetwork

End compute

End superstep

wait

BarrierBarrier

Begin superstep

compute

network

wait

Time to first message

End compute

End superstep

Results

50 100 150 200 250 3000

50

100

150

200

250

300

350

400

450

2B Vertices, 200B Edges, 20 Compute Threads

Workers

Itera

tion

Tim

e (

sec)

Increasing Workers

Increasing Data Size

1000000000 1010000000000

50

100

150

200

250

300

350

400

450

50 Workers, 20 Compute Threads

EdgesIt

era

tion

Tim

e (

sec)

Scalability Graphs

Lessons Learned

• Coordinating is a zoo. Be resilient with ZooKeeper.• Efficient networking is hard. Let Netty help.• Primitive collections, primitive performance. Use fastutil.• byte[] is simple yet powerful.• Being Unsafe can be a good thing.

• Have a graph? Use Giraph.

What’s the final result?

Comparison with Hive:• 20x CPU speedup• 100x Elapsed time speedup. 15 hours => 9 minutes.

Computations on entire Facebook graph no longer “weekend jobs”.Now they’re coffee breaks.

Questions?

Problem: Measurements.

• Need tools to gain visibility into the system.• Problems with connecting to Hadoop sub-processes.

Solution: Do it all.• YourKit – see YourKitProfiler• jmap – see JMapHistoDumper• VisualVM –with jstatd & ssh socks proxy• Yammer Metrics• Hadoop Counters• Logging & GC prints

Problem: Mutations• Synchronization.• Load balancing.

Solution: Reshuffle resources• Mutations handled at barrier between supersteps.• Master rebalances vertex assignments to optimize

distribution.• Handle mutations in batches.• Avoid if using byte[].• Favor algorithms which don’t mutate graph.