Tachyon meetup slides.

Tachyon MeetupBy David Gruzman, BigDataCraft.com

Hosted by Verint Israelhttp://www.verint.com/

http://www.verint.com/

About Us● We are young startup building ImpalaToGo

project. (http://impala2go.info/)● ImpalaToGo is SQL engine based on

Cloudera Impala.● Our focus is providing Relational View to

nested data.

http://impala2go.info/

Relational views● Show nested data, like JSON, as already

normalized into separate tables.● We provide it as a service

o You put data into S3o We provide you with SQL interface to relational

representation of your nested data.

Tachyonhttp://tachyonproject.org/

By David Gruzmanhttp://BigDatacraft.comhttp://impala2go.info

http://tachyonproject.org/

http://BigDatacraft.com/

http://impala2go.info/

Tachyon - What is it?● Tachyon is reliable memory centric distributed storage

system.● Advanced distributed caching system for big data.● Efficient data sharing mechanism between big data

frameworks

What problem Tachyon solves?

● Big data storage is mostly HDD based with performance (scan rate and latency) implications

● When data written to HDFS it is synchronously replicated. It’s slow.

● If data replication performed asynchronously there is a risk to loose data.

Alternative to replication● The idea is to remember the way some data

chunk was created, and recreate it if required.

● This concept is called Information Lineage. ● Beside the Tachyon it also used by Spark.● Only functional systems can use it.

Lineage - practical noteLineage can not be implemented inside storage layer alone - since it is not aware how the data is produced.It requires integration between code execution framework and storage layer.Tachyon can be integrated with frameworks,like MR, Spark, to support lineage.

Lineage - optimizations● There are a lot of parameters for optimization when to

choose between checkpointing and possible recomputation.

● Many parameters need to be taken into account:o Cost of storageo Cost and time of recomputationo Probability of failureso etc.

● Bottom line: It’s hard optimization problem.

Data sharing● The fact of fast reading and writing to DFS allows

simple and efficient data sharing among systems.● For example Impala and Spark clusters working

together. Spark produce data, and Impala querying it. Over and over.

● There is little sense to persist temporary data and pay price of replication.

Data caching● It is very convenient to work with cloud

storage as DFS.o However the access is slow and has high latency.

● Efficient and scalable caching layer provides significant performance gain.

Tachyon APITachyon provide two APIs1. Native Java File-like API.2. DFS: Hadoop file system API with true data

locality support.

Tachyon architecture

Concept of UnderFS● Tachyon relays on some other storage

system to safely store the data.● Supported file systems:

o HDFSo S3o Swifto GlusterFS

Checkpoints● Tachyon is checkpointing data to underlying

fs to make sure it is safe.● It happens synchronously or

asynchronously, depending on type of caching.

● “Write through” types of caching are working synchronously

Intermediate summaryTachyon is some kind of distributed file system.It main differences from others are:- RAM is its primary storage- There is UnderFS- Data can be recomputed using Lineage.

Tachyon as tiered storage

RAM

SSD

HDD

The capability to work with storage hierarchy allows to utilize available ram and SSD in more efficient way.

Tiered storage configuration

● Eviction policy - Only LRU for now.● Directories and their sizes configured for

each tier separately.● When some storage tier became full data is

spilled to next level by eviction policy.● When data is evicted from the lowest tier it is

written to UnderFS.

Implementation highlights● RAM drive is used for in-memory storage, to

overcome GC limitations● Files divided into blocks● Master nodes are sharing image and redo

logs.● Communication framework is netty or NIO

(configurable).

Tachyon as a cacheTachyon can be also viewed as a caching layer, which makes access to UnderFS faster.

DFS

Tachyon

Application / framework

Caching modesMUST_CACHE - tachyon will cache data, during write.TRY_CACHE - tachyon will try to cache.CACHE_THROUGH - try to cache, write to UnderFS synchronouslyTHROUGH - write to UnderFS without any cache.ASYNC_THROUGH - must cache and write to UnderFS asynchronously, or synchronous write to UnderFS

HDFS in memory caching1. Specify which files or directories will be

cached.2. Configure how many replicas are cached.3. Assign files and directories to memory pools.4. Off-heap caching is used

HDFS Caching

HDFS Caching benefits1. Manual selection what should be cached2. Control over number of replicas to be

cached. It prevents popular data to be cached on all nodes.

Comparison with HDFS

Lets see, what is common between HDFS and Tachyon, and what are differences.

What is in common● Both are distributed storage systems● Both use Disk and RAM to store data● Both provide interface of hadoop file system

(DFS)● Both store metadata in centralized way● Both written in Java

What is the difference● HDFS is “main storage”, it does not rely on

any other system.● Tachyon is a proxy with notion of “Under FS”

where is can save data and assume it is safe.

Difference - continued● HDFS main storage is Disk, while memory -

is optional for some files. Data is always persistent on disk.

● Tachyons’ main storage is RAM, and it is capable of spilling data to HDD/SSD storage hierarchy. In other words - it is tiered storage. Disk may be absent.

Tachyon vs HDFS summary● When HDFS used as primary data storage, HDFS

caching capability solves part of performance problems.● Caching is main feature, data sharing is less addressed

and achieved by-product.● In virtualized environments, where remote shared

storage is available (like S3, Swift) Tachyon can completely replace HDFS.

Tachyon use casesSystems can use Tachyon in two different ways.● As in memory storage for serialized blocks

or off-heap storage● As caching layer to get predictable

performance when working with S3.

Tachyon use cases

Spark Node

Tachyon - stored blocks of serialized RDD

DFS

Memory storage for serialized blocks

Tachyon use cases

S3

Tachyon

ImpalaToGo SparkMapReduce

Caching layer for predictable performance

Tachyon and ImpalaToGo case

● ImpalaToGo optimized for cloud usage● ImpalaToGo is integrated with Tachyon and

use it as caching layer.● Currently ImpalaToGo have both own native

caching layer and Tachyon.● We see Tachyon as strategic choice.

Tachyons’ value in cloud● When S3 used as DFS speed is low and not

predictable.● Tachyon helps here a lot, by utilizing RAM

and local drives to cache S3 data.● If cluster is elastic or transient, Tachyon may

serve as HDFS replacement.

Tachyon in cloud - cont● Tachyon is handling missing nodes by trying

to recompute missing data.● There is separate scheduler responsible for

recomputation tasks.

Tachyon Baidu case

Main storage in separate data center

SparkTachyon

View manager

Tachyons’ Baidu case- Data is stored in different data center- Tachyon is used as look-aside cache. Next

step will be “transparent mode” when Tachyon will sit between storage and Spark.

- Speedup of about x30 is achieved.- Details:

https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F14452042%2FTachyon_Meetup_2015_5_28-1-Baidu.pdf

https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F14452042%2FTachyon_Meetup_2015_5_28-1-Baidu.pdf

Summary● Tachyon is quickly improving and developing

scalable caching layer, compatible with Hadoop ecosystem.

● It may be considered in many cases when Spark, MR, Tez and cloud storage used in same system.

Questions?Thanks for your attention!

Tachyon meetup slides.

Technology

Transcript of Tachyon meetup slides.