In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

Post on 20-Mar-2017

282 views 2 download

Transcript of In-Memory Computing, Storage & Analysis: Apache Apex + Apache Geode

In-MemoryComputing,Storage&AnalysisApacheApex+ApacheGeode

SandeepDeshmukh AshishTadose

ProjectStatus

Mentor ListTed Dunning: Apache Member, MapR

Alan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks

Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks

ApexInApacheIncubationStage

ApacheApex(Incubating)CommitterList

Open-sourced inJuly2015

Over50 committersalready…Andgrowing….

ApexPlatformOverview EnterpriseEdition

Directed AcyclicGraph (DAG)

ApplicationProgrammingModel

• A Stream is a sequence of data tuples• An Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded

• Directed Acyclic Graph (DAG) is made up of operators and streams

Output StreamTuple Tuple er

Operator

er

Operator

er

Operator

er

Operator

ApplicationProgrammingModel

Hadoop EdgeNode

DTRTSManagement

Server

HadoopNode

YARNContainerApexAppMaster

HadoopNode

YARNContainerYARNContainer

YARNContainer

Thread1

Op2

Op1

Thread-N

Op3

StreamingContainer

HadoopNode

YARNContainerYARNContainer

YARNContainer

Thread1

Op2

Op1

Thread-N

Op3

StreamingContainer

CLI

RESTAPI

DTRTSManagement

Server

RESTAPI

PartofCommunityEdition

ApexComponentOverview

• NativeHadoopIntegration• PartitioningandScalingout• AdvancedWindowingSupport• StatefulFault-tolerance• ProcessingSemantics• ComputeLocality• Dynamicupdates

ApexFeatures…

ApacheApex-Malhar

• Processingdatain-motion

• Preventingdata-loss– bufferserver

• Inmemorydatastoresforqueryingdata

IMCComponentsinApex

Typicallatencies

WhyIn-MemoryComputing?

WhyIn-MemoryComputing?

In-memorycomputingwillhavelongterm,disruptiveimpactbyradicallychangingusersexpectations,applicationdesignprinciples,product'sarchitecturesandvendor'sstrategiesRAMisthenewdisk,

diskthenewtapeRAMisthenewdisk,diskthenewtape

In-memorycomputingisthefutureofcomputing..itoffersmassivenotonlyinTCOreductionbutacrossallfourvaluedimensions:performance,process,processinnovation,simplificationand

flexibility.

WhatareIMDG?• IMDGshostdatainmemoryanddistribute itacrossa clusterofcommodityservers• Themainaccesspatterniskey/valueaccess,MapReduce,variousformsofHPC-likeprocessing,

andalimiteddistributedqueryingandindexingcapabilities.

Whytheyareimportant?

• Performance– usingRAMisfasterthanusingdisk.• Extremely Highavailabilityofdata- bykeepingitinmemoryandinhighlydistributedcluster.• DataStructure– usingakey/valuestoreallowsgreater flexibility fortheapplicationdeveloper.

objectstoresimilar ininterfacetoatypicalconcurrenthashmap.• ScalableDataPartitioning• TransactionalACIDsupport

InMemoryDataGrid- IMDG

HighLevelArchitecture- Geode

GeodeFeatures

CoreFeatures• Linearscalability&latencyminiming datadistribution • Performanceoptimizedpersistence- Highavailability&durability • Configurableconsistency- regiontypes{partitioned, replicated&local}• Distributed transactions• Clusterresilience&failover

AdvancedFeatures• ServerFunctionExecution- Sendcomputationtodata• Asynchronous Events- Delivereventstoareceiverwithoutimpacting the

writepath• ContinuesQueries&Clientsubscriptions - Usefulforrefreshing client

cache

GeodeFeatures

CoreFeatures• Linearscalability&latencyminiming datadistribution • Performanceoptimizedpersistence- Highavailability&durability • Configurableconsistency- regiontypes{partitioned, replicated&local}• Distributed transactions• Clusterresilience&failover

AdvancedFeatures• ServerFunctionExecution- Sendcomputationtodata• Asynchronous Events- Delivereventstoareceiverwithoutimpacting the

writepath• ContinuesQueries&Clientsubscriptions - Usefulforrefreshing client

cache

� Caching for speed and scale– Read-through, Write-through, Write-behind

� Geode as the OLTP system of record– Data in-memory for low latency, on disk for durability

� Parallel compute engine

� Real-time analytics

ApplicationPatterns

GeodereadsWithConsistentLatencyandCPU

• Scaledfrom256clientsand2serversto1280clientsand10servers• Partitionedregionwithredundancyand1Kdatasize

0

2

4

6

8

10

12

14

16

18

0

1

2

3

4

5

6

2 4 6 8 10

Spee

dup

ServerHosts

speedup

latency(ms)

CPU%

GeodeFeatures

Geode3.5-4.5XFasterThanCassandraforYCSB

Roadmap

� HDFS persistence

� Off-heap storage

� Lucene indexes

� Spark integration

� Cloud Foundry service

…and other ideas from the Geode community!

Roadmap

StreamingmeetsInMemoryDataGrid

Apex+GeodeApexOperatorcheck-pointinginGeodestore• BetterlatencyforcheckpointoperationsthanHDFScheck-pointing • MakesApexDAGacompletein-memorypipeline• https://issues.apache.org/jira/browse/APEXCORE-283

WriteApexdatastreamstoGeodestore• Apexoutput operatorimplementationwhichwritesdatatoGeoderegion• Usecases

• IngeststreamingdatainGeodeforfurtherprocessing• StoreDataprocessedbyApexpipeline inGeodestoretoserveuserqueries

• https://malhar.atlassian.net/projects/MLHR/issues/MLHR-1942

Questions???

ThankYou…