Spark + HBase
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
1.882 -
download
1
Transcript of Spark + HBase
![Page 1: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/1.jpg)
Spark + HBaseBringing HBase Data Efficiently into Spark with DataFrame Support Zhan ZhangSoftware Engineer04/08/2016
![Page 2: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/2.jpg)
Page 2 © Hortonworks Inc. 2014
About Zhan Zhang
Zhan Zhang (Software Engineer at Hortonworks)
Currently Focus on Apache Spark and Hadoop, etc
Contribute to Apache Spark, Yarn, HBase, Ambari, etc
Experiences on Computer Networks, Distributed System and Machine Learning Platform
![Page 3: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/3.jpg)
Page 3 © Hortonworks Inc. 2014
Why Revamp the Existing HBase Connector?
Limited Spark Support in HBase Upstream– Scalability– RDD level, but Spark is moving to DataFrame/Dataset– Data Loss and Data Duplication
Stability– Correctness– Stability Impact with Co-processor.– Serialized RDD Lineage to HBase– Maintenance Overhead: Internal Hacks
![Page 4: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/4.jpg)
Page 4 © Hortonworks Inc. 2014
What Improvement Have We Made? Combine Spark and HBase
– Spark Catalyst Engine for Query Plan and Optimization– HBase for Fast Access KV Store– Implement Standard External Data Source with Built-in Filter
High Performance– Data Locality: Move Computation to Data– Partition Pruning: Task only Performed in RS Holding Requested Data– Column Pruning / Predicate Pushdown: Reduce Network Overhead
Full Fledged DataFrame Support– Spark-SQL– Integrated Language Query
Run on Top of Existing HBase Table– Native Support Java Primitive Types
![Page 5: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/5.jpg)
Page 5 © Hortonworks Inc. 2014
More …
Composite Key
Avro Format
Customized Serdes
![Page 6: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/6.jpg)
Page 6 © Hortonworks Inc. 2014
Usage - Define the Catalog
Header (Calibri Bold 28 pt)
![Page 7: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/7.jpg)
Page 7 © Hortonworks Inc. 2014
Usage– Write to HBase
![Page 8: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/8.jpg)
Page 8 © Hortonworks Inc. 2014
Usage– Construct DataFrame
![Page 9: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/9.jpg)
Page 9 © Hortonworks Inc. 2014
Usage - Language Integrate Query
![Page 10: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/10.jpg)
Page 10 © Hortonworks Inc. 2014
Usage - Spark SQL
![Page 11: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/11.jpg)
Page 11 © Hortonworks Inc. 2014
Usage - With Other Data Sources
![Page 12: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/12.jpg)
Page 12 © Hortonworks Inc. 2014
![Page 13: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/13.jpg)
Page 13 © Hortonworks Inc. 2014
Header (Calibri Bold 28 pt)
![Page 14: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/14.jpg)
Page 14 © Hortonworks Inc. 2014
Spark HBase Connector Architecture
![Page 15: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/15.jpg)
Page 15 © Hortonworks Inc. 2014
Byte Array Order: SHORT/INT/LONG
0 21 … … MAX -2 -1MIN … …
WHERE X <= 2
WHERE X >= -2
![Page 16: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/16.jpg)
Page 16 © Hortonworks Inc. 2014
Implementation
Partition Pruning: – Split into Multiple Range, e.g., WHERE X < 2
Data Locality: – Each RDD Partition Has Preferred Location
Column Pruning: – Required Column in Scan/BulkGet
Predicate Pushdown: – HBase Built-in Filters
Scan/BulkGets: – Grouped by Region Server
![Page 17: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/17.jpg)
Page 17 © Hortonworks Inc. 2014
![Page 18: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/18.jpg)
Page 18 © Hortonworks Inc. 2014
![Page 19: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/19.jpg)
Page 19 © Hortonworks Inc. 2014
BACK UP
![Page 20: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/20.jpg)
Page 20 © Hortonworks Inc. 2014
Kerberos Cluster Kerberos Ticket
Token Retrieval and Renewal
Long Running Service
![Page 21: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/21.jpg)
Page 21 © Hortonworks Inc. 2014
FLOAT/DOUBLE: IEEE-754
0.0 0.2… … … MAX -2.0… MIN…
WHERE X <= 2.0D
WHERE X >= -2.0D
-0.0
![Page 22: Spark + HBase](https://reader035.fdocuments.in/reader035/viewer/2022081511/587155581a28ab8e5b8b507d/html5/thumbnails/22.jpg)
Page 22 © Hortonworks Inc. 2014
HBase Meta Table