Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

Exploiting GPUs for Columnar DataFrames

Kiran Lonikar

About Myself: Kiran Lonikar● Presently working as Staff Engineer with Informatica, Bangalore

○ Keep track of technology trends○ Work on futuristic products/features

● Passionate about new technologies, gadgets and healthy food● Education:

○Indian Institute of Technology, Bombay (1992)○Indian Institute of Science Bangalore (1994)

About Informatica• Put Potential of Data to work. Informatica

helps you make data ready for use in any way possible, so you can put truly great data at the center of everything you do.

• The #1 Independent Leader in Data Integration

• Focus on Big Data, Master Data Management, Cloud Integration and Data Security

• Founded: 1993

• Revenue 2014: $1.048 billion

• Employees: ~3700

• Partners: 500+– Major SI, ISV, OEM and On-

Demand Leaders

• Customers: Over 5,800– Customers in 82 Countries

Annual Total Revenue ($ millions)

2005-2014Total Revenue CAGR = 16%

* A reconciliation of GAAP and non-GAAP results is provided in the Appendix section, as well as on Informatica’s Investor Relations website

Agenda● Introducing GPUs● Existing Applications in Big Data● CPU the new bottleneck● Project Tungsten● Proposal: Extending Tungsten

○GPU for parallel execution across rows○Code generation changes (minor refactoring)○Batched Execution, Columnar layout (major refactoring to DataFrame)

● Results, Demo● Future work, competing products

GPUs are Omnipresent

Jetson TK1 192 cores GPU, 5”X5”, 20WGPU Servers

upto 5760 cores

g2: $0.65/hour: 1536 cores

Nexus 9: 192 cores

Hardware Architecture: Latency Vs Throughput

Time

t1 t2 ... t32ins 1ins 2ins 3ins 4

Thread block

Warp 1t1 t2 ... t32

ins 1ins 2ins 3ins 4

Warp 2

...

SIMT: Single Instruction Multiple Thread

GPU Programming Model

CPU RAM

CPU ProcessingGPU RAM

GPU Processing

PCIe Bus

GPU Processing

GPU Processing

GPU Processing

GPU Processing

GPU Processing

Shared CPU+GPU RAM

CPU ProcessingHeterogeneous System

Architecture based SoC

GPU Processing

GPU ProcessingGPU

ProcessingGPU

Processing

● CUDA C/C++ (NVidia GPUs)● OpenCL C/C++ (All GPUs)● JavaCL/ScalaCL, Aparapi,

Rootbeer● JDK 1.9 Lambdas

GPU 1

GPU 2GPU RAM

GPUs in the world of Big Data

LHC CERN’s ROOT: 30 PB per day, GPU based ML packages

Analytic DBs12 GPUs: 60,000 cores on a nodegpudb, sqream, mapd

Deep Learning

Image Classification

Speech Recognition NLP

Genomics, DNA

SparkCL: ● Aparapi based APIs to

develop spark closures.● Aparapi converts Java

code to OpenCL and run on GPUs.

https://gitlab.com/mora/spark-ucores/blob/master/README.md%20

Natural progression into computing dimension

++

2008 onwards

2012 onwards

2015 onwards

Spark SQL Architecture

null bit set (1 bit/col) values (8 bytes/col) variable length data (length, data)



row 1

row 2

row 3

Tungsten Row: Instead of Array of Java Objects

nulltypeId row 1row 2row 3

nullnullnull


nullnullnull


nullnullnull


nullnullnull

column 1 column 2 column 3 column 4

Columnar Cache

10Gbps ethernet, Infiniband

SSD, Striped HDD arrays

• Higher IO throughput: From Project Tungsten blog and Renold Xin’s talk slide 21– Hardware advances in last 5 years: 10x improvements

– Software advances:• Spark Optimizer: Prune input data to avoid unnecessary disk IO• Improved file formats: Binary, compressed, columnar (Parquet, ORC)

• Less memory pressure:– Hardware: High memory bandwidths– Software: Taking over memory allocation

⇒ More data available to process. CPU the new bottleneck.

CPU the new bottleneck

https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

http://www.slideshare.net/SparkSummit/reynold-xin

Project Tungsten • Taking Over memory management and bypass GC

– Avoid large Java object overhead and GC overhead– Replace Java Objects allocation with sun.misc.Unsafe based explicit

allocation and freeing– Replace general purpose data structures like java.util.HashMap with explicit

binary map• Cache Aware computation

– Change internal data structures to make them cache friendly• Co-locate key and value reference in one record for sorting

• Code Generation– Expressions of columns for selecting and filtering executed through generated

Java code Avoids expensive expression tree evaluation for each row⇒

Proposal: Execution on GPUsGoal: Change execution within a partition from serial row by row to batched/vectorized parallel execution

• Change code generation to generate OpenCL code• Change executor code (Project, TungstenProject in basicOperators.scala) to execute OpenCL code through JavaCL

• **Columnar layout of input data for GPU execution– BatchRow/CacheBatch: References to required columnar arrays instead of

creating and processing InternalRow objects– UnsafeColumn/ByteBuffer: Columnar structure to be used for GPU execution

a0 b0 c0 a1 b1 c1

a0 a1 a2

row wise

columnar

b0 b1 b2

a0 b0 c0

a1 b1 c1

a2 b2 c2

A B C

c0 c1 c2

a2 b2 c2

Code Generation Changes// Existing Gnerated Java codeclass SpecificUnsafeProjection extends UnsafeProjection {

private UnsafeRow row = new UnsafeRow();// buffer for 2 cols, null bitsprivate byte[] buffer11 = new byte[24];private int cursor12 = 24; // size of buffer for 2 cols// initialization code, constructor etc.public UnsafeRow apply(InternalRow i) { double primitive3 = -1.0; int fixedOffset = Platform.BYTE_ARRAY_OFFSET; row.pointTo(buffer11, fixedOffset, 2, cursor12); if(nullChecks == false) { primitive3 = 2*i.getInt(0) + 4*i.getDouble(1); row.setDouble(1, primitive3); } else row.setNull(1); return row;}

}

// New OpenCL sample code: Columnar__kernel void computeExpression(const int* a, const char *aNulls,const int* b, const char *bNulls,int* output, char *outNulls,int dataSize){ int i = get_global_id(0); if(i < dataSize) { if(nullChecks == false) output[i] = 2*a[i] + 4*b[i]; else outNulls[i] = 1; }}// Scala code to drive the OpenCL code

1. rowIterator ByteBuffers with a, b, aNulls, ⇒bNulls : 20x time of 2,3,4

2. Transfer ByteBuffers

3. Execute computeExpression

4. read output, outNulls into ByteBuffers Cache⇒

Row wise execution

row wise CPU RAM

Input Datarow wise GPU RAM

// Only A and C needed to compute D// B not neededtypedef struct {float a, float b, float c} row;__kernel void expr(row *r, float *d, int n) { int id = get_global_id(0); if(id < n) d[id] = 3*r[id].a + 2*r[id].c;}

a0 b0 c0

a1 b1 c1

a2 b2 c2

A B C

a0 b0 c0 a1 b1 c1 a2 b2 c2

a0 b0 c0 a1 b1 c1 a2 b2 c2

t1

r0

row wise SMP Cache

a0 b0 c0 a1 b1 c1 a2 b2 c2

t2

r1

t3

r2

Streaming Multiprocessor

Columnar execution

Columnar CPU RAM

Input DataColumnar GPU RAM

// Only A and C needed to compute D// Only A and C transferred__kernel void expr(float *a, float *c, float *d, int n) { int id = get_global_id(0); if(id < n) d[id] = 3*a[id] + 2*c[id];}

a0 b0 c0

a1 b1 c1

a2 b2 c2

A B C

a0 b0 c0a1 b1 c1a2 b2 c2

a0 c0a1 c1a2 c2

t1

a0Columnar SMP Cache

a0 c0a1 c1a2 c2

t2

a1

t3

a2

c0 c1 c2

Streaming Multiprocessor

JVM Considerations

• Row wise representation: Array of Java objects– Java objects not same as C structs: Members not contiguous– Serialization needed before transfer to GPU RAM

• Columnar representation: Arrays of individual members– Already serialized– Save Host-GPU and GPU RAM-SMP Cache data transfer– Avoid copying from input rows to projected InternalRow objects

DataFrame Execution: Currentval data = sc.parallelize(1 to size, 5).map { x => (x, x*x)}.toDF("key", "value")val data1 = data.select($"key", $"value", $"key"*2 + $"value"*4).cachedata1.show() // show first 20 rows: Trigger execution

1, 1*1 1, 1, 1*2+1*4

Columnar Cache: buildBuffers

123

149

620422, 2*2 2, 4, 2*2+4*4

3, 3*3 3, 8, 2*3+4*9

Columnar Cache Rows⇒1 1 62 4 203 9 42

123

DataFrame Execution: Proposedval data = sc.parallelize(1 to size, 5).map { x => (x, x*x)}.toDF("key", "value")val data1 = data.select($"key", $"value", $"key"*2 + $"value"*4).cachedata1.show() // show first 20 rows: Trigger execution

Columnar Cache: buildBuffers

123

149

62042

Columnar Cache Rows⇒1 1 62 4 203 9 42

123

123

149

123

149

62042

GPU

Proposal: Batched Execution

In Memory RDDs DFs⇒Byte code Modification through Javassist to build BatchRow+UnsafeColumn

Columnar CacheInput: Parquet, ORC, Relational DBs

Pipelined operations: Filter, Join, Union, Sort, Group by, ...

In Memory RDDs DFs⇒Byte code Modification through Javassist to consume BatchRow+UnsafeColumn

Output: Parquet, ORC, Relational DBs

Pipelined operations: Filter, Join, Union, Sort, Group by, ...

http://jboss-javassist.github.io/javassist/

http://jboss-javassist.github.io/javassist/

Results

GPU

CPU

Roadmap for future changes• Spark

– Multi-GPU– Sorting: GPU based TimSort– Aggregations (groupBy)– Union– Join

• Other projects capable of competing with Spark– Impala (C++, easier to adapt than Scala/JVM for GPU)– CERN Root (C++ REPL, multi-node)– Flink– Thrust (CUDA C++, single node, single GPU)– Boost Compute (OpenCL, C++, single node, single GPU)– VexCL (C++, OpenCL, CUDA, multi-GPU, multi node)

Q&AContact Info

○ Twitter @KiranLonikar○ https://www.linkedin.com/in/kiranlonikar○ [email protected]

https://www.linkedin.com/in/kiranlonikar

mailto:[email protected]

Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar

Data & Analytics

Transcript of Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar