Using Spark with Tachyon by Gene Pang
-
Upload
spark-summit -
Category
Data & Analytics
-
view
1.985 -
download
0
Transcript of Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon: An Open Source Memory-Centric Distributed Storage System
Gene Pang, Tachyon [email protected]
October 29, 2015 @ Spark Summit Europe
• Team consists of Tachyon creators, top contributors
• Series A ($7.5 million) from Andreessen Horowitz
• Committed to Tachyon Open Source Project
• www.tachyonnexus.com
Outline• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Outline• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
History of Tachyon• Started at UC Berkeley AMPLab
– From Summer 2012– Same lab produced Apache Spark and Apache
Mesos
• Open sourced on April 2013– Apache License 2.0– Latest Release: Version 0.8.0 (October 2015)
• Deployed at > 100 companies
Contributors Growth
1 315
30
46
70
111
v0.1 Dec'12
v0.2 Apr'13
v0.3 Oct'13
v0.4 Feb'14
v0.5 Jul'14
v0.6 Mar'15
v0.7 Jul'15
Performance Trend: Memory is Fast
• RAM throughput increasing exponentially
• Disk throughput increasing slowly
Memory-locality is important!
Outline• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Issue 1Data Sharing bottleneck in
analytics pipeline:Slow writes to disk
Spark Job1
SparkMemory
block 1
block 3
Spark Job2
SparkMemory
block 3
block 1
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Issue 1
Spark Job
SparkMemory
block 1
block 3
Hadoop MR Job
YARN
HDFS / Amazon S3block 1
block 3
block 2
block 4
Data Sharing bottleneck in analytics pipeline:Slow writes to disk
storage engine & execution enginesame process
Issue 1 resolved with TachyonMemory-speed data sharing
among different jobs and different frameworks
Spark Job
Spark mem
Hadoop MR Job
YARN
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Tachyonin-memory
block 1
block 3 block 4
storage engine & execution enginesame process
Issue 2
Spark Task
Spark Memoryblock manager
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
In-Memory data loss when computation crashes
storage engine & execution enginesame process
Issue 2
crash
Spark Memoryblock manager
block 1
block 3
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
In-Memory data loss when computation crashes
HDFS / Amazon S3
Issue 2
block 1
block 3
block 2
block 4
crashstorage engine & execution enginesame process
In-Memory data loss when computation crashes
HDFS / Amazon S3block 1
block 3
block 2
block 4Tachyonin-memory
block 1
block 3 block 4
Issue 2 resolved with Tachyon
Spark Task
Spark Memoryblock manager
storage engine & execution enginesame process
Keep in-memory data safe, even when computation crashes
Issue 2 resolved with Tachyon
HDFSdisk
block 1
block 3
block 2
block 4Tachyon
in-memoryblock 1
block 3 block 4
crash
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Keep in-memory data safe, even when computation crashes
HDFS / Amazon S3
Issue 3In-memory Data Duplication &
Java Garbage Collection
Spark Job1
SparkMemory
block 1
block 3
Spark Job2
SparkMemory
block 3
block 1
block 1
block 3
block 2
block 4
storage engine & execution enginesame process
Issue 3 resolved with TachyonNo in-memory data duplication,
much less GC
Spark Job1
Spark mem
Spark Job2
Spark mem
HDFS / Amazon S3block 1
block 3
block 2
block 4
HDFSdisk
block 1
block 3
block 2
block 4Tachyonin-memory
block 1
block 3 block 4
storage engine & execution enginesame process
Tachyon Use Case: Baidu• Framework: SparkSQL• Under Storage: Baidu’s File System• Tachyon Storage Media: MEM + HDD• 100+ Tachyon nodes• 1PB+ Tachyon managed storage• 30x Performance Improvement
Tachyon Use Case: An Oil Company
• Framework: Spark
• Under Storage: GlusterFS
• Tachyon Storage Media: MEM only
• Analyzing data in traditional storage
Tachyon Use Case: A SAAS Company
• Framework: Spark
• Under Storage: S3
• Tachyon Storage Media: SSD only
• Elastic Tachyon deployment
Outline• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Tachyon Storage System (HDFS, S3, …)
tachyon://host:port/
Data Users
Reports Sales Alice Bob
s3n://bucket/directory/
Data Users
Reports Sales Alice Bob
4. Transparent Naming
• Persisted Tachyon files are mapped to under storage
• Tachyon paths are preserved in under storage
Tachyon Storage System A
tachyon://host:port/
Data Users
Alice Bob
hdfs://host:port/
Users
Alice Bob
Storage System B
s3n://bucket/directory/
Reports Sales
Reports Sales
5. Unified Namespace
• Unified namespace for multiple storage systems
• Share data across storage systems• On-the-fly mounting/unmounting
Additional FeaturesRemote Write Support
Easy deployment with Mesos and Yarn
Initial Security Support
One Command Cluster Deployment
Metrics for Clients/Workers/Master
Outline• Introduction to Tachyon
• Using Spark with Tachyon
• New Tachyon Features
• Getting Involved
Try Tachyon: http://tachyon-project.org
Develop Tachyon: https://github.com/amplab/tachyon
Meet Friends: http://www.meetup.com/Tachyon
Tachyon Nexus: http://www.tachyonnexus.com
Email: [email protected]
Thank you!