October 2013 HUG: GridGain-In memory accelaration
-
Upload
yahoo-developer-network -
Category
Technology
-
view
103 -
download
0
description
Transcript of October 2013 HUG: GridGain-In memory accelaration
In-Memory Accelerator for Hadoop™
www.gridgain.com #gridgain
Slide www.gridgain.com
Hadoop: Pros and Cons
2
What is Hadoop?> Hadoop is a batch system> HDFS - Hadoop Distributed File System> Data must be ETL-ed into HDFS> Parallel processing over data in HDFS> Hive, Pig, HBase, Mahout...> Most popular data warehouse
Pros:> Scales very well> Fault tolerant and resilient> Very active and rich eco-system> Process TBs/PBs in parallel fashion
Cons:> Batch oriented - real time not possible> Complex deployment> Significant execution overhead> HDFS is IO and network bound
Slide www.gridgain.com
In-Memory Accelerator For Hadoop: Overview
3
Up To 100x Faster:1. In-Memory File System
100% compatible with HDFSBoost HDFS performance by removing IO overheadDual-mode: standalone or cachingBlend into Hadoop ecosystem
2. In-Memory MapReduceEliminate Hadoop MapReduce overheadAllow for embedded executionRecord-based
Slide www.gridgain.com
GridGain: In-Memory Computing Platform
4
Slide www.gridgain.com
In-Memory Accelerator For Hadoop: Details
> PnP IntegrationMinimal or zero code change
> Any Hadoop distroHadoop v1 and v2
> In-Memory File System100% compatible with HDFSDual-mode: no ETL needed, read/write-throughBlock-level caching & smart evictionAutomatic pre-fetchingBackground fragmentizerOn-heap and off-heap memory utilization
> In-Memory MapReduceIn-process co-located computations - access GGFS in-processEliminate unnecessary IPCEliminate long task startup timeEliminate mandatory sorting and re-shuffling on reduction
5
Slide www.gridgain.com
GridGain Visor: Unified DevOps
6
HDFS Profiler
File Manager
Slide www.gridgain.com
Benchmarks: GGFS vs HDFS
7
10 nodes cluster of Dell R610> Each has dual 8-core CPU> Ubuntu 12.4, Java 7> 10 GBE network> Stock Apache Hadoop 2.x
Slide www.gridgain.com
Comparison: Hadoop Accelerator vs. Spark
8
> No ETL requiredAutomatic HDFS read-through and write-throughData is loaded on demand
> Per-block file cachingOnly hot data blocks are in memory
> Strong management capabilitiesGridGain Visor - Unified DevOps
> Requires data ETL-ed into SparkChanges to data do not get propagated to HDFSExplicit ETL step consumes time
> Needs to have full file loadedIf does not fit - gets offloaded to disk
> No management capabilities
Slide www.gridgain.com
Customer Use Case: Task & Challenge
9
Task:> Real time search with MapReduce> Dataset size is 5TB > Writes 80%, reads 20%> Perceptual real time SLA (few seconds)
Challenge:> Hadoop MapReduce too slow (> 30 sec)> Data scanning slow due to constant IO> Overall job takes > 1 minute
Slide www.gridgain.com
Customer Use Case: Solution
10
> Utilize existing serversStart GridGain data node on every server
> Only put highly utilized files in GGFSUser controlled caching
> In-Memory MapReduce over GGFSEmbedded processing
> Results under 3 seconds
GridGain Systemswww.gridgain.com
1065 East Hillsdale Blvd, Suite 230Foster City, CA 94404
@gridgain