Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
-
Upload
yahoo-developer-network -
Category
Documents
-
view
2.785 -
download
1
description
Transcript of Jan 2013 HUG: Cloud-Friendly Hadoop and Hive
![Page 1: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/1.jpg)
Qubole Inc., Proprietary
Hadoop User GroupAshish ThusooJan 16, 2013
![Page 2: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/2.jpg)
Qubole Inc., Proprietary
About Me
Big Data Veteran
Ran the data infrastructure team at Facebookbefore starting Qubole
Co-created Hive in 2007 @ Facebook
••
•
![Page 3: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/3.jpg)
Qubole Inc., Proprietary
What is Qubole?
A comprehensive cloud data platform basedon Hadoop and Hive for data in the cloud
Turnkey Infrastructure
Cloud Optimized Stack
Open Data Formats
Useful for exploring data and creating batchprocessing applications/data pipelines
•
---
•
![Page 4: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/4.jpg)
Qubole Inc., Proprietary
Why Qubole?
End Users(User Ops, Product Managers
etc.)
Heterogenous Data(Structured & Unstructured)
The Intermediaries(Data Scientists and
Engineers)
BOTTLENECK
![Page 5: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/5.jpg)
Qubole Inc., Proprietary
Qubole Service
Cloud Data Service
Cloud Data PlatformElastic . Robust . Fast
DataMarts
Explore Schedule SDK
EC2 / S3
Big Data Technology Stack
ODBC
Connectors
API
Logs
Events
DBs
Metrics
Cloud Sources
![Page 6: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/6.jpg)
Qubole Inc., Proprietary
Cloud vs Bare Metal
Dynamic vs Fixed Provisioning
Separation between Compute and Storage
Purchasing and Budgeting
•••
![Page 7: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/7.jpg)
Qubole Inc., Proprietary
Dynamic Provisioning
Advantage: Transient Clusters
Burden: How big of a cluster do I need?
Solution: Auto-scaled Hadoop
•••
![Page 8: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/8.jpg)
Qubole Inc., Proprietary
Challenges:Auto-scaledHadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
Adapting to Burstiness
Current load is not enough, also need to predict futureload
Adapting State-fully
Removing HDFS nodes is risky withoutdecommissioning
•-
•-
![Page 9: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/9.jpg)
Qubole Inc., Proprietary
Implementation:Auto-scaledHadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
TaskTrackers report launch times ofJobTracker
JT computes amount of time required tofinish existing workloads
If the time is above a certain threshold thenmore nodes are added
At hourly boundaries the nodes are removedin case of insufficient work
•
•
•
•
![Page 10: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/10.jpg)
Qubole Inc., Proprietary
Implementation:Auto-scaledHadoop
http://www.qubole.com/blog/index.php/first-auto-scaling-hadoop-hive-clusters/
Restrictions on Deleting Nodes:
Nodes Containing Task Outputs of Current Jobs
Fast Decommissioning Done for Data Nodes
Minimum Cluster Size Threshold
Fast Decommissioning - possible becauseHDFS is a cache for us
•---
•
![Page 11: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/11.jpg)
Qubole Inc., Proprietary
Compute & Storage on theCloud (EC2/S3)
On the cloud Compute and Storage areSeparate!!
Advantage: Don’t Pay for CPU for Storing Data
Burden: Separation Can Cause Slowness &Variability
Solutions:
Caching File System
Masking S3 Latency
•
••
•
--
![Page 12: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/12.jpg)
Qubole Inc., Proprietary
Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/
![Page 13: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/13.jpg)
Qubole Inc., Proprietary
Caching File Systemhttp://www.qubole.com/blog/index.php/columnar-cloud-cache/
Benefits:
Masks the performance variance associated with S3 whilereading data
Columnar caching on the fly enables data to be persisted inopen formats while still giving the benefits of performance
•-
-
![Page 14: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/14.jpg)
Qubole Inc., Proprietary
Masking S3 Latencyhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
File Operations in S3 are much slower thanHDFS
Problem: This leads to bad performance whendata is distributed in a lot of files
Solution:
Fast Split Generation Algorithm
Pipelined File Opens
•
•
•-
-
![Page 15: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/15.jpg)
Qubole Inc., Proprietary
Faster Split Generationhttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
Directory operations with merging instead ofper file metadata (upto 8x speedup)
•
![Page 16: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/16.jpg)
Qubole Inc., Proprietary
Pipelined File Openshttp://www.qubole.com/blog/index.php/optimizing-hadoop-for-s3-part-1/
Open S3 files before they are read (30%improvements in simple queries)
•
![Page 17: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/17.jpg)
Qubole Inc., Proprietary
Purchasing Instances
Buying Instances on Spot Prices vs On-Demand Prices
Benefits: Cheaper on average by 50-60%
Problems: Spot instances are not guaranteedand can be taken away anytime
Bad for MapReduce
Disastrous for HDFS
•
••
-
-
![Page 18: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/18.jpg)
Qubole Inc., Proprietary
Spotted Hadoop Clustershttp://www.qubole.com/blog/index.php/hadoop-auto-scale-ec2-spot-instances/
Simplified Spot Bidding Strategy
Configuring Bidding Timeouts
Configuring % of instances through spot
Configuring bid pricses
Spot Instance Aware HDFS Block Placement
Ensures One Replica of the Blocks Reside On On-DemandNodes
•-
-
-
•-
![Page 19: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/19.jpg)
Qubole Inc., Proprietary
Conclusion
Cloud is Different from Bare Metal
Check out more optimizations that we havemade to run Hadoop and Hive optimally in thecloud at our blog
••
http://www.qubole.com/blog/
![Page 20: Jan 2013 HUG: Cloud-Friendly Hadoop and Hive](https://reader031.fdocuments.in/reader031/viewer/2022020717/558952e8d8b42ae40b8b4647/html5/thumbnails/20.jpg)
Qubole Inc., Proprietary
Thank you.
Free Sign up for Qubole at https://api.qubole.com/users/sign_upCareers at http://www.qubole.com/careers