Gl conference2014 scalabledata_yucheng

39
The Technology Behind Yucheng Low, PhD Chief Architect GraphLab Create

description

GraphLab Co-Founder and CTO presents on the scalable data structures in GraphLab Create

Transcript of Gl conference2014 scalabledata_yucheng

Page 1: Gl conference2014 scalabledata_yucheng

The Technology

Behind

Yucheng Low, PhD Chief Architect

GraphLab Create  

Page 2: Gl conference2014 scalabledata_yucheng

GraphLab Philosophy Users-First Architecture

Page 3: Gl conference2014 scalabledata_yucheng

Architecture

Systems

User

Page 4: Gl conference2014 scalabledata_yucheng

Architecture

Systems

User

Systems-First Architectures

Systems define constraints. Optimize for performance.

PowerGraph  

Page 5: Gl conference2014 scalabledata_yucheng

Architecture

Systems

User

Users-First Architectures

Users define constraints. Optimize for user interaction.

Page 6: Gl conference2014 scalabledata_yucheng

What is a Users-First Architecture for Data Science?

Page 7: Gl conference2014 scalabledata_yucheng

SFrame and SGraph

Building on decades of database and systems research.

Built by data scientists, for data scientists.

Page 8: Gl conference2014 scalabledata_yucheng

SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation

User Com.

Title Body

User Disc.

Page 9: Gl conference2014 scalabledata_yucheng

Enabling users to easily and efficiently translate between both representations to get the best of both worlds.

Page 10: Gl conference2014 scalabledata_yucheng

SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation

User Com.

Title Body

User Disc.

Page 11: Gl conference2014 scalabledata_yucheng

SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation

User Com.

Title Body

User Disc.

Page 12: Gl conference2014 scalabledata_yucheng

SFrame Design

Jobs fail because: •  Machine run out of memory •  Did not set Java Heap Size correctly •  Resource Configuration X needs to be

bigger.

Pain Point #1: Resource Limits

Page 13: Gl conference2014 scalabledata_yucheng

SFrame Design •  Graceful Degradation as 1st principle

•  Always Works

Pain Point #2: Too Strict or Too Weak Schemas We want strong schema types. We also want weak schema types. Missing Values

Page 14: Gl conference2014 scalabledata_yucheng

SFrame Design •  Graceful Degradation as 1st principle

•  Always Works

•  Rich Datatypes •  Strong schema types: int, double, string...

•  Weak schema types: list, dictionary

Page 15: Gl conference2014 scalabledata_yucheng

SFrame Design •  Graceful Degradation as 1st principle

•  Always Works

•  Rich Datatypes •  Strong schema types: int, double, string...

•  Weak schema types: list, dictionary

Pain Point #3: Feature Manipulation Difficult or costly to inspect existing features and create new features. Hard to perform data exploration.

Page 16: Gl conference2014 scalabledata_yucheng

SFrame Design •  Graceful Degradation as 1st principle

•  Always Works

•  Rich Datatypes •  Strong schema types: int, double, string...

•  Weak schema types: list, dictionary

•  Columnar Architecture •  Easy feature engineering + Vectorized feature operations.

•  Immutable columns + Lazy evaluation •  Statistics + visualization + sketches

Page 17: Gl conference2014 scalabledata_yucheng

SFrame Python API Example Make  a  little  SFrame  of  1  column  and  5  values:  sf  =  gl.SFrame({‘x’:[1,2,3,4,5]})    Normalizes  the  column  x:  sf[‘x’]  =  sf[‘x’]  /  sf[‘x’].sum()    Uses  a  python  lambda  to  create  a  new  column:  sf[‘x-­‐squared’]  =  sf[‘x’].apply(lambda  x:  x*x)    Create  a  new  column  using  a  vectorized  operator:  sf[‘x-­‐cubed’]  =  sf[‘x-­‐squared’]  *  sf[‘x’]    Create  a  new  SFrame  taking  only  2  of  the  columns:  sf2  =  sf[[‘x’,’x-­‐squared’]]    

Page 18: Gl conference2014 scalabledata_yucheng

SFrame Querying Supports most typical SQL SELECT operations using a Pythonic syntax.

SQL SELECT  Book.title  AS  title,  COUNT(*)  AS  authors    

FROM  Book    JOIN  Book_author  ON  Book.isbn  =  Book_author.isbn    GROUP  BY  Book.title;  

SFrame Python  

Book.join(Book_author,  on=‘isbn’)          .groupby(‘title’,  {‘authors’:gl.aggregate.COUNT})  

 

Page 19: Gl conference2014 scalabledata_yucheng

SFrame Columnar Encoding

user   movie   ra+ng  

Ne#lix  Dataset,    99M  rows,  3  columns,  ints  1.4GB  raw  289MB  gzip  compressed  

Page 20: Gl conference2014 scalabledata_yucheng

SFrame Columnar Encoding

user   movie   ra+ng   Type  aware  compression:  •  Variable  Bit  length  Encode  •  Frame  Of  Reference  Encode  •  ZigZag  Encode  •  Delta  /  Delta  ZigZag  Encode  •  Dic+onary  Encode  •  General  Purpose  LZ4  

Ne#lix  Dataset,    99M  rows,  3  columns,  ints  1.4GB  raw  289MB  gzip  compressed  

SFrame  File  

Page 21: Gl conference2014 scalabledata_yucheng

SFrame Columnar Encoding

user   movie   ra+ng   Type  aware  compression:  •  Variable  Bit  length  Encode  •  Frame  Of  Reference  Encode  •  ZigZag  Encode  •  Delta  /  Delta  ZigZag  Encode  •  Dic+onary  Encode  •  General  Purpose  LZ4  

Ne#lix  Dataset,    99M  rows,  3  columns,  ints  1.4GB  raw  289MB  gzip  compressed  

User                à        176  MB   14.2  bits/int  

SFrame  File  

0.02  bits/int  Movie                à        257  KB  3.8  bits/int  Ra+ng      à        47  MB  

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  Total                      à          223MB  

Page 22: Gl conference2014 scalabledata_yucheng

SFrame Columnar Encoding

user   movie   ra+ng   Type  aware  compression:  •  Variable  Bit  length  Encode  •  Frame  Of  Reference  Encode  •  ZigZag  Encode  •  Delta  /  Delta  ZigZag  Encode  •  Dic+onary  Encode  •  General  Purpose  LZ4  

Ne#lix  Dataset,    99M  rows,  3  columns,  ints  1.4GB  raw  289MB  gzip  compressed  

User                à        176  MB   14.2  bits/int  

SFrame  File  

0.02  bits/int  Movie                à        257  KB  3.8  bits/int  Ra+ng      à        47  MB  

-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  Total                      à          223MB  

10s  

Page 23: Gl conference2014 scalabledata_yucheng

SFrames Distributed

•  Distributed Dataflow •  Columnar Query Optimizations •  Communicate columnar compressed blocks

rather than row tuples.

The  choice  of  distributed  or  local  execuFon  is  a  quesFon  of  query  opFmizaFon.  

Page 24: Gl conference2014 scalabledata_yucheng

SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation

User Com.

Title Body

User Disc.

Page 25: Gl conference2014 scalabledata_yucheng

SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation

User Com.

Title Body

User Disc.

Page 26: Gl conference2014 scalabledata_yucheng

SGraph

•  SFrame backed graph representation. Inherits SFrame properties. •  Data types, External Memory, Columnar,

compression, etc.

•  Layout optimized for batch external memory computation.

Page 27: Gl conference2014 scalabledata_yucheng

SGraph Layout

1  

2  

3  

4  

Vertex  SFrames  

VerFces  ParFFoned  into  p  =  4  SFrames.  

Page 28: Gl conference2014 scalabledata_yucheng

SGraph Layout

1  

2  

3  

4  

Vertex  SFrames  

__id   Name   Address   ZipCode  1011   John   …   98105  2131   Jack   …   98102  

VerFces  ParFFoned  into  p  =  4  SFrames.  

Page 29: Gl conference2014 scalabledata_yucheng

SGraph Layout

1  

2  

3  

4  

Vertex  SFrames  

(1,2)  

(2,2)  

(3,2)  

(4,2)  

(1,1)  

(2,1)  

(3,1)  

(4,1)  

(1,4)  

(2,4)  

(3,4)  

(4,4)  

(1,3)  

(2,3)  

(3,3)  

(4,3)  

Edge  SFrames  

Edges  parFFoned  into  p^2  =  16  SFrames.  

Page 30: Gl conference2014 scalabledata_yucheng

SGraph Layout

1  

2  

3  

4  

Vertex  SFrames  

(1,2)  

(2,2)  

(3,2)  

(4,2)  

(1,1)  

(2,1)  

(3,1)  

(4,1)  

(1,4)  

(2,4)  

(3,4)  

(4,4)  

(1,3)  

(2,3)  

(3,3)  

(4,3)  

Edge  SFrames  

Edges  parFFoned  into  p^2  =  16  SFrames.  

Page 31: Gl conference2014 scalabledata_yucheng

SGraph Layout

1  

2  

3  

4  

Vertex  SFrames  

(1,2)  

(2,2)  

(3,2)  

(4,2)  

(1,1)  

(2,1)  

(3,1)  

(4,1)  

(1,4)  

(2,4)  

(3,4)  

(4,4)  

(1,3)  

(2,3)  

(3,3)  

(4,3)  

Edge  SFrames  

Edges  parFFoned  into  p^2  =  16  SFrames.  

Page 32: Gl conference2014 scalabledata_yucheng

SGraph Layout Vertex  SFrames  

Edge  SFrames  

Page 33: Gl conference2014 scalabledata_yucheng

SGraph Layout Vertex  SFrames  

Edge  SFrames  

Page 34: Gl conference2014 scalabledata_yucheng

SGraph Layout Vertex  SFrames  

Edge  SFrames  

Page 35: Gl conference2014 scalabledata_yucheng

SGraph Layout Vertex  SFrames  

Edge  SFrames  

Page 36: Gl conference2014 scalabledata_yucheng

SGraph Layout Vertex  SFrames  

Edge  SFrames  

Page 37: Gl conference2014 scalabledata_yucheng

Deep Integration of SFrames and SGraphs •  Seamless interaction between graph data

and table data. •  Queries can be performed easily across

graph and tables.

Page 38: Gl conference2014 scalabledata_yucheng

Demo

Page 39: Gl conference2014 scalabledata_yucheng

SFrame: Scalable Tabular Data Manipulation SGraph: Scalable Graph Manipulation

User Com.

Title Body

User Disc.

User-first architecture. Built by data scientists,

for data scientists.