Using Spark with Tachyon by Gene Pang

Using Spark with Tachyon: An Open Source Memory-Centric Distributed Storage System

Gene Pang, Tachyon [email protected]

October 29, 2015 @ Spark Summit Europe

Who Am I?• Gene Pang

• PhD from UC Berkeley AMPLab

• Software Engineer at Tachyon Nexus

• Team consists of Tachyon creators, top contributors

• Series A ($7.5 million) from Andreessen Horowitz

• Committed to Tachyon Open Source Project

• www.tachyonnexus.com

Outline• Introduction to Tachyon

• Using Spark with Tachyon

• New Tachyon Features

• Getting Involved

History of Tachyon• Started at UC Berkeley AMPLab

– From Summer 2012– Same lab produced Apache Spark and Apache

Mesos

• Open sourced on April 2013– Apache License 2.0– Latest Release: Version 0.8.0 (October 2015)

• Deployed at > 100 companies

Contributors Growth

1 315

30

46

70

111

v0.1 Dec'12

v0.2 Apr'13

v0.3 Oct'13

v0.4 Feb'14

v0.5 Jul'14

v0.6 Mar'15

v0.7 Jul'15

Contributors Growth

150+ Contributors

50+ Organizations

One of the FastestGrowing Big Data

Open Source Projects

Thanks to Contributors and Users!

Reported Tachyon Usage

What is Tachyon?

Open SourceMemory-Centric

Distributed Storage System

Tachyon Stack

Why Use Tachyon?

Performance Trend: Memory is Fast

• RAM throughput increasing exponentially

• Disk throughput increasing slowly

Memory-locality is important!

Price Trend: Memory is Cheaper

source: jcmit.com

These Memory Trends are Realized By Many…

Is theProblem Solved?

Missing a Solution for the Storage Layer

enables reliable data sharingat memory-speed within and

across computation frameworks/jobs

How Does Tachyon Work?

Memory-Centric Storage Architecture

Lineage in Storage Layer

Tachyon Memory-Centric Architecture

Lineage in Tachyon

Fast and general engine for large-scale data processing

What are some potential issues?

Issue 1Data Sharing bottleneck in

analytics pipeline:Slow writes to disk

Spark Job1

SparkMemory

block 1

block 3

Spark Job2

SparkMemory

block 3

block 1

HDFS / Amazon S3block 1

block 3

block 2

block 4

storage engine & execution enginesame process

Issue 1

Spark Job

SparkMemory

block 1

block 3

Hadoop MR Job

YARN


block 3

block 2

block 4

Data Sharing bottleneck in analytics pipeline:Slow writes to disk


Issue 1 resolved with TachyonMemory-speed data sharing

among different jobs and different frameworks

Spark Job

Spark mem

Hadoop MR Job

YARN


block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2

block 4Tachyonin-memory

block 1

block 3 block 4


Issue 2

Spark Task

Spark Memoryblock manager

block 1

block 3


block 3

block 2

block 4

In-Memory data loss when computation crashes


Issue 2

crash


block 1

block 3


block 3

block 2

block 4



HDFS / Amazon S3

Issue 2

block 1

block 3

block 2

block 4

crashstorage engine & execution enginesame process



block 3

block 2


block 1

block 3 block 4

Issue 2 resolved with Tachyon

Spark Task



Keep in-memory data safe, even when computation crashes

Issue 2 resolved with Tachyon

HDFSdisk

block 1

block 3

block 2

block 4Tachyon

in-memoryblock 1

block 3 block 4

crash


block 3

block 2

block 4


Keep in-memory data safe, even when computation crashes

HDFS / Amazon S3

Issue 3In-memory Data Duplication &

Java Garbage Collection

Spark Job1

SparkMemory

block 1

block 3

Spark Job2

SparkMemory

block 3

block 1

block 1

block 3

block 2

block 4


Issue 3 resolved with TachyonNo in-memory data duplication,

much less GC

Spark Job1

Spark mem

Spark Job2

Spark mem


block 3

block 2

block 4

HDFSdisk

block 1

block 3

block 2


block 1

block 3 block 4


Tachyon Use Case: Baidu• Framework: SparkSQL• Under Storage: Baidu’s File System• Tachyon Storage Media: MEM + HDD• 100+ Tachyon nodes• 1PB+ Tachyon managed storage• 30x Performance Improvement

Tachyon Use Case: An Oil Company

• Framework: Spark

• Under Storage: GlusterFS

• Tachyon Storage Media: MEM only

• Analyzing data in traditional storage

Tachyon Use Case: A SAAS Company

• Framework: Spark

• Under Storage: S3

• Tachyon Storage Media: SSD only

• Elastic Tachyon deployment

Tachyon 0.8.0 Just Released!

http://tachyon-project.org/

Use different frameworks to enable workloads on different storage

1. Growing Ecosystem

MEM

SSDHDD

Faster

Greater Capacity

2. Tiered Storage

Tachyon manages more than DRAM

MEM only

MEM + HDD

SSD only

2. Tiered Storage

Configurable storage tiers

Evict stale data to lower tier

Promote hot data to upper tier

3. Pluggable Data Management Policy

Tachyon Storage System (HDFS, S3, …)

tachyon://host:port/

Data Users

Reports Sales Alice Bob

s3n://bucket/directory/

Data Users

Reports Sales Alice Bob

4. Transparent Naming

• Persisted Tachyon files are mapped to under storage

• Tachyon paths are preserved in under storage

Tachyon Storage System A

tachyon://host:port/

Data Users

Alice Bob

hdfs://host:port/

Users

Alice Bob

Storage System B

s3n://bucket/directory/

Reports Sales

Reports Sales

5. Unified Namespace

• Unified namespace for multiple storage systems

• Share data across storage systems• On-the-fly mounting/unmounting

Additional FeaturesRemote Write Support

Easy deployment with Mesos and Yarn

Initial Security Support

One Command Cluster Deployment

Metrics for Clients/Workers/Master

Welcome users and collaborators!

Memory-Centric Distributed Storage System

Try Tachyon: http://tachyon-project.org

Develop Tachyon: https://github.com/amplab/tachyon

Meet Friends: http://www.meetup.com/Tachyon

Tachyon Nexus: http://www.tachyonnexus.com

Email: [email protected]

Thank you!

Using Spark with Tachyon by Gene Pang

Data & Analytics

Transcript of Using Spark with Tachyon by Gene Pang