GraphLab Tutorial

Carnegie Mellon University

GraphLab TutorialYucheng Low

GraphLab Team

YuchengLow

AapoKyrola

JosephGonzalez

DannyBickson

Carlos Guestrin

GraphLab 0.5 (2010) Internal Experimental Code

Insanely Templatized

Development History

GraphLab 1 (2011)

Nearly Everything is Templatized

First Open Source Release (< June 2011 LGPL >= June 2011 APL)

GraphLab 2 (2012)

Many Things are Templatized

Shared Memory : Jan 2012Distributed : May 2012

Graphlab 2 Technical Design Goals

Improved useabilityDecreased compile timeAs good or better performance than GraphLab 1Improved distributed scalability

… other abstraction changes … (come to the talk!)

Development HistoryEver since GraphLab 1.0, all active development are open source (APL):

code.google.com/p/graphlabapi/

(Even current experimental code. Activated with a --experimental flag on ./configure )

Guaranteed Target Platforms• Any x86 Linux system with gcc >= 4.2• Any x86 Mac system with gcc 4.2.1 ( OS X 10.5 ?? )

• Other platforms?

… We welcome contributors.

Tutorial OutlineGraphLab in a few slides + PageRankChecking out GraphLab v2Implementing PageRank in GraphLab v2Overview of different GraphLab schedulersPreview of Distributed GraphLab v2

(may not work in your checkout!)Ongoing work… (however much as time allows)

WarningA preview of code still in intensive development!

Things may or may not work for you!

Interface may still change!

GraphLab 1 GraphLab 2 still has a number of performance regressions we are ironing out.

PageRank ExampleIterate:

Where:α is the random reset probabilityL[j] is the number of links on page j

The GraphLab Framework

Scheduler Consistency Model

Graph BasedData Representation

Update FunctionsUser Computation

Data GraphA graph with arbitrary data (C++ Objects) associated with each vertex and edge

Vertex Data:• Webpage• Webpage Features

Edge Data:• Link weight

Graph:• Link graph

pagerank(i, scope){ // Get Neighborhood data (R[i], Wij, R[j]) scope;

// Update the vertex data

// Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

;][)1(][][

ji jRWiR

Update Functions

An update function is a user defined program which when applied to a vertex transforms the data in the scope of the vertex

Dynamic Schedule

dcbaCPU 1

Process repeats until scheduler is empty

Source Code Interjection 1

Graph, update functions, and schedulers

--scope=vertex--scope=edge

Consistency

Trade-offConsistency “Throughput”

# “iterations” per second

Goal of ML algorithm: Converge

False Trade-off

Ensuring Race-Free CodeHow much can computation overlap?

Importance of ConsistencyFast ML Algorithm development cycle:

Tweak Model

Necessary for framework to behave predictably and consistently and avoid problems caused by non-determinism.Is the execution wrong? Or is the model wrong?

Full Consistency

Guaranteed safety for all update functions

Full Consistency

Parallel update only allowed two vertices apart Reduced opportunities for parallelism

Obtaining More Parallelism

Not all update functions will modify the entire scope!

Belief Propagation: Only uses edge dataGibbs Sampling: Only needs to read adjacent vertices

Edge Consistency

Obtaining More Parallelism

“Map” operations. Feature extraction on vertex data

Vertex Consistency

Shared VariablesGlobal aggregation through Sync OperationA global parallel reduction over the graph dataSynced variables recomputed at defined intervals while update functions are running

Sync: HighestPageRank

Sync: Loglikelihood

Source Code Interjection 2

Shared variables

What can we do with these primitives?

…many many things…

Matrix FactorizationNetflix Collaborative Filtering

Alternating Least Squares Matrix Factorization

Model: 0.5 million nodes, 99 million edges

Netflix

Movies

NetflixSpeedup Increasing size of the matrix factorization

Video Co-SegmentationDiscover “coherent”segment types acrossa video (extends Batra et al. ‘10)

1. Form super-voxels video2. EM & inference in Markov random field

Large model: 23 million nodes, 390 million edges

GraphLab

Many MoreTensor FactorizationBayesian Matrix FactorizationGraphical Model Inference/LearningLinear SVMEM clusteringLinear Solvers using GaBPSVDEtc.

Distributed Preview

GraphLab 2 Abstraction

Changes(an overview couple of them)

(Come to the talk for the rest!)

Exploiting Update Functors

(for the greater good)

Exploiting Update Functors (for the greater good)

1. Update Functors store state2. Scheduler schedules update functor instances.

3. We can use update functors as a controlled asynchronous message passing to communicate between vertices!

Delta Based Update Functorsstruct pagerank : public iupdate_functor<graph, pagerank> {

double delta;pagerank(double d) : delta(d) { }void operator+=(pagerank& other) { delta +=

other.delta; }void operator()(icontext_type& context) {

vertex_data& vdata = context.vertex_data();

vdata.rank += delta;if(abs(delta) > EPSILON) {

double out_delta = delta * (1 – RESET_PROB) *

1/context.num_out_edges(edge.source());

context.schedule_out_neighbors(pagerank(out_delta));}

}};// Initial Rank: R[i] = 0;// Initial Schedule: pagerank(RESET_PROB);

Asynchronous Message PassingObviously not all computation can be written this way. But when it can; it can be extremely fast.

Factorized Updates

PageRank in GraphLab

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

PageRank in GraphLab

struct pagerank : public iupdate_functor<graph, pagerank> {

void operator()(icontext_type& context) {vertex_data& vdata =

context.vertex_data(); double sum = 0;foreach ( edge_type edge,

context.in_edges() )sum +=

context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;double old_rank = vdata.rank;vdata.rank = RESET_PROB + (1-RESET_PROB) *

sum;double residual = abs(vdata.rank –

old_rank) /

context.num_out_edges();if (residual > EPSILON)

context.reschedule_out_neighbors(pagerank());}

Atomic Single Vertex Apply

Parallel Scatter [Reschedule]

Parallel “Sum” Gather

Decomposable Update Functors

Decompose update functions into 3 phases:

+ + … + Δ

ParallelSum

User Defined:

Gather( ) ΔY

Δ1 + Δ2 Δ3

Y Scope

Gather

YApply( , Δ) Y

Apply the accumulated value to center vertex

User Defined:

Scatter( )

Update adjacent edgesand vertices.

User Defined:Y

Scatter

Factorized PageRankstruct pagerank : public iupdate_functor<graph, pagerank> { double accum = 0, residual = 0;

void gather(icontext_type& context, const edge_type& edge) {

accum += context.const_edge_data(edge).weight *

context.const_vertex_data(edge.source()).rank;}void merge(const pagerank& other) { accum +=

other.accum; }void apply(icontext_type& context) {

vertex_data& vdata = context.vertex_data();double old_value = vdata.rank;vdata.rank = RESET_PROB + (1 - RESET_PROB)

* accum; residual = fabs(vdata.rank – old_value) /

context.num_out_edges();}void scatter(icontext_type& context, const

edge_type& edge) {if (residual > EPSILON)

context.schedule(edge.target(), pagerank());

Demo of *everything*

PageRank

Ongoing WorkExtensions to improve performance on large graphs.

(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs

All New rewrite of distributed GraphLab (come back in May!)

Ongoing WorkExtensions to improve performance on large graphs.

(See the GraphLab talk later!!)Better distributed Graph representation methodsPossibly better Graph PartitioningOff-core Graph storageContinually changing graphs

All New rewrite of distributed GraphLab (come back in May!)

GraphLab Tutorial

Documents

Transcript of GraphLab Tutorial

New Distributed GraphLab: A Framework for Machine Learning and …vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf · 2019. 7. 12. · Distributed GraphLab: A Framework for Machine

Carnegie Mellon GraphLab A New Framework for Parallel Machine Learning Yucheng Low Aapo Kyrola Carlos Guestrin Joseph Gonzalez Danny Bickson Joe Hellerstein.

GraphLab: how I understood it with sample codeakyrola/files/graphlabDebriefBySampleCode.pdf · GraphLab • GraphLab API – Defined and maintained by us • GraphLab Engine – Reference

Distributed GraphLab: A Framework for Machine Learning and Data

Create - Nvidiaimages.nvidia.com/.../ECS-Israel-2014/GPU-based-deep-learning-DannyBickson-Graphlabs.pdfDanny Bickson Co-Founder Create TM. GraphLab Project History GraphLab((2009)

HowtoUseBigDataBench4prof.ict.ac.cn/BigDataBench_asplos_18/HowToUse_BigDatBench4.pdf · Y HowtoUseBigDataBench4.0 JianfengZhan, ChenZheng, andWanlingGao ... Hadoop,Spark,Flink, GraphLab,MPI

Data Analytics - University of WashingtonMachine Learning: GraphLab • ML and data mining are hugely popular areas now! • clustering, modeling, classiﬁcation, prediction • Need

Graphlab dunning-clustering

Introduction to Large-Scale Graph Computation + GraphLab and GraphChi Aapo Kyrola, akyrola@cs.cmu.eduakyrola@cs.cmu.edu Feb 27, 2013.

Machine Learning in the Cloud with GraphLab

Schism: Graph Partitioning for OLTP Databases in a Relational Cloud Implications for the design of GraphLab

1 QSX: Querying Social Graphs Parallel models for querying graphs beyond MapReduce Vertex-centric models –Pregel (BSP) –GraphLab GRAPE.

Big Data Analytics with Storm, Spark and GraphLab

Danny Bickson - Python based predictive analytics with GraphLab Create

GraphLab Conference 2014 Cytoscape Flyer

Analyzing the Startup Economy with GraphLab Create

New GraphLab: A New Framework For Parallel Machine Learning · 2018. 2. 23. · GraphLab framework provides a collection of base sched-ules. To represent Jacobi style algorithms (e.g.,

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

New Datacenter Simulation Methodologies: GraphLab · 2020. 9. 10. · Datacenter Simulation Methodologies: GraphLab Tamara Silbergleit Lehman, Qiuyun Wang, Seyed Majid Zahedi and

Graphlab under the hood