Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
-
Upload
spark-summit -
Category
Data & Analytics
-
view
1.994 -
download
0
Transcript of Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPUs for Columnar DataFrames
Kiran Lonikar
About Myself: Kiran Lonikar● Presently working as Staff Engineer with Informatica, Bangalore
○ Keep track of technology trends○ Work on futuristic products/features
● Passionate about new technologies, gadgets and healthy food● Education:
○Indian Institute of Technology, Bombay (1992)○Indian Institute of Science Bangalore (1994)
About Informatica• Put Potential of Data to work. Informatica
helps you make data ready for use in any way possible, so you can put truly great data at the center of everything you do.
• The #1 Independent Leader in Data Integration
• Focus on Big Data, Master Data Management, Cloud Integration and Data Security
• Founded: 1993
• Revenue 2014: $1.048 billion
• Employees: ~3700
• Partners: 500+– Major SI, ISV, OEM and On-
Demand Leaders
• Customers: Over 5,800– Customers in 82 Countries
Annual Total Revenue ($ millions)
2005-2014Total Revenue CAGR = 16%
* A reconciliation of GAAP and non-GAAP results is provided in the Appendix section, as well as on Informatica’s Investor Relations website
Agenda● Introducing GPUs● Existing Applications in Big Data● CPU the new bottleneck● Project Tungsten● Proposal: Extending Tungsten
○GPU for parallel execution across rows○Code generation changes (minor refactoring)○Batched Execution, Columnar layout (major refactoring to DataFrame)
● Results, Demo● Future work, competing products
GPUs are Omnipresent
Jetson TK1 192 cores GPU, 5”X5”, 20WGPU Servers
upto 5760 cores
g2: $0.65/hour: 1536 cores
Nexus 9: 192 cores
Hardware Architecture: Latency Vs Throughput
Time
t1 t2 ... t32ins 1ins 2ins 3ins 4
Thread block
Warp 1t1 t2 ... t32
ins 1ins 2ins 3ins 4
Warp 2
...
SIMT: Single Instruction Multiple Thread
GPU Programming Model
CPU RAM
CPU ProcessingGPU RAM
GPU Processing
PCIe Bus
GPU Processing
GPU Processing
GPU Processing
GPU Processing
GPU Processing
Shared CPU+GPU RAM
CPU ProcessingHeterogeneous System
Architecture based SoC
GPU Processing
GPU ProcessingGPU
ProcessingGPU
Processing
● CUDA C/C++ (NVidia GPUs)● OpenCL C/C++ (All GPUs)● JavaCL/ScalaCL, Aparapi,
Rootbeer● JDK 1.9 Lambdas
GPU 1
GPU 2GPU RAM
GPUs in the world of Big Data
LHC CERN’s ROOT: 30 PB per day, GPU based ML packages
Analytic DBs12 GPUs: 60,000 cores on a nodegpudb, sqream, mapd
Deep Learning
Image Classification
Speech Recognition NLP
Genomics, DNA
SparkCL: ● Aparapi based APIs to
develop spark closures.● Aparapi converts Java
code to OpenCL and run on GPUs.
Natural progression into computing dimension
++
2008 onwards
2012 onwards
2015 onwards
Spark SQL Architecture
null bit set (1 bit/col) values (8 bytes/col) variable length data (length, data)
null bit set (1 bit/col) values (8 bytes/col) variable length data (length, data)
null bit set (1 bit/col) values (8 bytes/col) variable length data (length, data)
row 1
row 2
row 3
Tungsten Row: Instead of Array of Java Objects
nulltypeId row 1row 2row 3
nullnullnull
nulltypeId row 1row 2row 3
nullnullnull
nulltypeId row 1row 2row 3
nullnullnull
nulltypeId row 1row 2row 3
nullnullnull
column 1 column 2 column 3 column 4
Columnar Cache
10Gbps ethernet, Infiniband
SSD, Striped HDD arrays
• Higher IO throughput: From Project Tungsten blog and Renold Xin’s talk slide 21– Hardware advances in last 5 years: 10x improvements
– Software advances:• Spark Optimizer: Prune input data to avoid unnecessary disk IO• Improved file formats: Binary, compressed, columnar (Parquet, ORC)
• Less memory pressure:– Hardware: High memory bandwidths– Software: Taking over memory allocation
⇒ More data available to process. CPU the new bottleneck.
CPU the new bottleneck
Project Tungsten • Taking Over memory management and bypass GC
– Avoid large Java object overhead and GC overhead– Replace Java Objects allocation with sun.misc.Unsafe based explicit
allocation and freeing– Replace general purpose data structures like java.util.HashMap with explicit
binary map• Cache Aware computation
– Change internal data structures to make them cache friendly• Co-locate key and value reference in one record for sorting
• Code Generation– Expressions of columns for selecting and filtering executed through generated
Java code Avoids expensive expression tree evaluation for each row⇒
Proposal: Execution on GPUsGoal: Change execution within a partition from serial row by row to batched/vectorized parallel execution
• Change code generation to generate OpenCL code• Change executor code (Project, TungstenProject in basicOperators.scala) to execute OpenCL code through JavaCL
• **Columnar layout of input data for GPU execution– BatchRow/CacheBatch: References to required columnar arrays instead of
creating and processing InternalRow objects– UnsafeColumn/ByteBuffer: Columnar structure to be used for GPU execution
a0 b0 c0 a1 b1 c1
a0 a1 a2
row wise
columnar
b0 b1 b2
a0 b0 c0
a1 b1 c1
a2 b2 c2
A B C
c0 c1 c2
a2 b2 c2
Code Generation Changes// Existing Gnerated Java codeclass SpecificUnsafeProjection extends UnsafeProjection {
private UnsafeRow row = new UnsafeRow();// buffer for 2 cols, null bitsprivate byte[] buffer11 = new byte[24];private int cursor12 = 24; // size of buffer for 2 cols// initialization code, constructor etc.public UnsafeRow apply(InternalRow i) { double primitive3 = -1.0; int fixedOffset = Platform.BYTE_ARRAY_OFFSET; row.pointTo(buffer11, fixedOffset, 2, cursor12); if(nullChecks == false) { primitive3 = 2*i.getInt(0) + 4*i.getDouble(1); row.setDouble(1, primitive3); } else row.setNull(1); return row;}
}
// New OpenCL sample code: Columnar__kernel void computeExpression(const int* a, const char *aNulls,const int* b, const char *bNulls,int* output, char *outNulls,int dataSize){ int i = get_global_id(0); if(i < dataSize) { if(nullChecks == false) output[i] = 2*a[i] + 4*b[i]; else outNulls[i] = 1; }}// Scala code to drive the OpenCL code
1. rowIterator ByteBuffers with a, b, aNulls, ⇒bNulls : 20x time of 2,3,4
2. Transfer ByteBuffers
3. Execute computeExpression
4. read output, outNulls into ByteBuffers Cache⇒
Row wise execution
row wise CPU RAM
Input Datarow wise GPU RAM
// Only A and C needed to compute D// B not neededtypedef struct {float a, float b, float c} row;__kernel void expr(row *r, float *d, int n) { int id = get_global_id(0); if(id < n) d[id] = 3*r[id].a + 2*r[id].c;}
a0 b0 c0
a1 b1 c1
a2 b2 c2
A B C
a0 b0 c0 a1 b1 c1 a2 b2 c2
a0 b0 c0 a1 b1 c1 a2 b2 c2
t1
r0
row wise SMP Cache
a0 b0 c0 a1 b1 c1 a2 b2 c2
t2
r1
t3
r2
Streaming Multiprocessor
Columnar execution
Columnar CPU RAM
Input DataColumnar GPU RAM
// Only A and C needed to compute D// Only A and C transferred__kernel void expr(float *a, float *c, float *d, int n) { int id = get_global_id(0); if(id < n) d[id] = 3*a[id] + 2*c[id];}
a0 b0 c0
a1 b1 c1
a2 b2 c2
A B C
a0 b0 c0a1 b1 c1a2 b2 c2
a0 c0a1 c1a2 c2
t1
a0Columnar SMP Cache
a0 c0a1 c1a2 c2
t2
a1
t3
a2
c0 c1 c2
Streaming Multiprocessor
JVM Considerations
• Row wise representation: Array of Java objects– Java objects not same as C structs: Members not contiguous– Serialization needed before transfer to GPU RAM
• Columnar representation: Arrays of individual members– Already serialized– Save Host-GPU and GPU RAM-SMP Cache data transfer– Avoid copying from input rows to projected InternalRow objects
DataFrame Execution: Currentval data = sc.parallelize(1 to size, 5).map { x => (x, x*x)}.toDF("key", "value")val data1 = data.select($"key", $"value", $"key"*2 + $"value"*4).cachedata1.show() // show first 20 rows: Trigger execution
1, 1*1 1, 1, 1*2+1*4
Columnar Cache: buildBuffers
123
149
620422, 2*2 2, 4, 2*2+4*4
3, 3*3 3, 8, 2*3+4*9
Columnar Cache Rows⇒1 1 62 4 203 9 42
123
DataFrame Execution: Proposedval data = sc.parallelize(1 to size, 5).map { x => (x, x*x)}.toDF("key", "value")val data1 = data.select($"key", $"value", $"key"*2 + $"value"*4).cachedata1.show() // show first 20 rows: Trigger execution
Columnar Cache: buildBuffers
123
149
62042
Columnar Cache Rows⇒1 1 62 4 203 9 42
123
123
149
123
149
62042
GPU
Proposal: Batched Execution
In Memory RDDs DFs⇒Byte code Modification through Javassist to build BatchRow+UnsafeColumn
Columnar CacheInput: Parquet, ORC, Relational DBs
Pipelined operations: Filter, Join, Union, Sort, Group by, ...
In Memory RDDs DFs⇒Byte code Modification through Javassist to consume BatchRow+UnsafeColumn
Output: Parquet, ORC, Relational DBs
Pipelined operations: Filter, Join, Union, Sort, Group by, ...
Results
GPU
CPU
Roadmap for future changes• Spark
– Multi-GPU– Sorting: GPU based TimSort– Aggregations (groupBy)– Union– Join
• Other projects capable of competing with Spark– Impala (C++, easier to adapt than Scala/JVM for GPU)– CERN Root (C++ REPL, multi-node)– Flink– Thrust (CUDA C++, single node, single GPU)– Boost Compute (OpenCL, C++, single node, single GPU)– VexCL (C++, OpenCL, CUDA, multi-GPU, multi node)
Q&AContact Info
○ Twitter @KiranLonikar○ https://www.linkedin.com/in/kiranlonikar○ [email protected]